Comment Threads and Computer Science Fundamentals

2022, April 30

A personal side-project of mine has turned out to have unexpected technical depth, and provide a good teachable example of breadth-first-search, depth-first-search, and recursion.

Recently I volunteered to preserve a WordPress blog by translating it into a format viewable in one’s own computer and easily host-able without depending on server-side functionality. I had done that for my two WordPress blogs years ago, and it had not been hard.

But this blog had a lot of content in the form of long comment threads. A very active community had been having discussions over fifteen years. The comments should be preserved, and should be viewable as threaded comments.

The process I wanted to follow was:

Export the site from the WordPress admin panel, producing complete dumps in WordPress’s custom XML format, which includes comments.
Use one of the available tools to convert the XML files into Markdown files with Hugo- or Jekyll-style metadata.
Generate the HTML files and the site directory structure using hugo.

But for step 2, none of the available tools did what I needed with the comments. The existing conversion tools either discarded the comments, or collected them and deferred their processing to an external—and externally hosted—tool such as disqus. For this project, though, comments should be statically generated and locally rendered. No run-time dependencies were wanted.¹

After reading the source code of several existing tools to convert WordPress XML², I decided to write my own from scratch in Go. The initial version which did the job I needed is finished, and I have made it available open-source… to add to the pool of incomplete but useful tools in this problem space: migrate-wp-with-comments.

The procedures to parse and render comment threads turns out to be a good teachable example for basic search techniques, and for recursion. Let’s dive in.

The WordPress XML export contains a list of comments for each post or page.

Each comment is given a unique ID, and also has a ParentID field. Comments that were made directly on the post have a ParentID of 0. Comments made in response to a comment, say to the comment with ID 5, will all have their ParentID set to 5.

We can draw a family tree of comments:

Notice the arrows point each comment to the comment it is responding to. And notice the match between a comment’s ID, and the ParentID of the comments responding to it.

With threaded comments, we would like to render this family tree visible. In HTML we could use unordered lists to display the comment tree. A given comment would have its own content, plus a list of children comments, indented as usual list items in HTML, each of which might have its own children… Given a good representation of the comment tree in memory, generating the HTML would be easy then.

How do we get from the exported flat-list of comments to the comment tree?

If you had each comment printed out in an index card, you may well think of the following idea to build the comment tree by hand:

Start with the “root” comments, i.e. those with ParentID == 0. We will make a list called comments to process with them. Let’s place them as a column on the left.
For each comment in the comments to process list, find its children among the list of exported comments. E.g. given a comment with ID == 5, find all comments with ParentID == 5. Once we have found the children, add a link to them from the comment (e.g. by adding them to an array of children) and remove the comment (the one with with ID == 5) from the comments to process list. Add the children to the bottom of the comments to process list.
Repeat step 2 until there are no more comments to process.

So, we go generation by generation. We add all the root comments, then all the children comments, then all the grandchildren. That is, this is basically breadth-first-search.

Let’s show this with an example: suppose we want to build the following comment tree which is made of three branches or sub-trees.

In the figures below, we show the comments to process in light blue.

Even if we’re not searching but building, even if we don’t go through all the details you’d find in an algorithms book³, we’d still need a queue—a FIFO if you prefer that term—to keep track of the comments to process, and the generation-by-generation processing is exactly that of BFS. This is bona fide BFS.

This breadth-first strategy has a peculiarity: we may have a bunch of incomplete comment threads until the very end of the algorithm. In our example, we completed two comment threads in the last iteration, out of three.

We can use a depth-first approach instead, which puts the onus on completing the comment sub-tree in progress before moving on to the next:

We begin, as in our previous strategy, with our root comments, but this time, the list of comments to process has a head we’ll keep track of.
Take the head comment. Find its children and add them to the top of the list of comments to process. Pick one of them as the new head.
If in step 2 the head had no more children to add, remove it from the list of comments to process. The next comment in comments to process will become the new head.
Repeat from step 2 until there are no comments to process.

In prose, and even written as code, the difference between this depth-first strategy and our previous breadth-first one is subtle. But the difference is apparent visually. For the same target comment tree as before, we show the steps. Notice the head comment is shown in dark blue, the other comments to process are shown in light blue:

expand branch 1

Notice that we are not doing any processing on branches 2 and 3, and are trying to complete branch 1 as soon as possible. We first had the root of branch 1 as head, then its first child is the new head. But, as it has no children, we can take it off the comments to process list, and mark it as completed in light green. Its next sibling is the new head.

None of the three children comments of comment 1 have any children, and so they don’t require further processing, and are marked in green. When the three of them are done, the initial head, comment 1, has no unprocessed children, so we can remove it from the list of comments to process. We mark it in green in the next image, where we already show progress into the processing of branch 2.

As before, the algorithm we’ll use is not exactly the textbook-example of depth-first-search, but for all practical purposes it is depth-first-search. To implement this we would need to use a stack—or a LIFO if you prefer that term.

I find this new algorithm less “natural” to do by hand than the depth-first approach. But it has two very interesting properties:

It completes sub-trees / branches as soon as possible. We did not start processing branch 2 until we completed branch 1.
It is self-similar. Notice the similarity between our last two figures. The process as seen from the only child of comment 2 rhymes with the process seen from comment 2.

In other words: this algorithm is recursive. Recursion is a concept that some beginners in computer science, and even mathematics, find difficult. I have always found that surprising, given that we have very familiar (pun intended) examples of recursion: family trees. I think it’s pretty clear that my ancestry tree is the merger of my mother’s ancestry tree, and my father’s. And, a marriage’s descendants are the children plus the children’s descendants.

For our concrete example: the comment tree we’re after is the combination of branches 1, 2, and 3. And, sub-tree 2 is the concatenation of comment 2, and the sub-tree rooted at comment 2’s child.

In fact, a significant advantage of depth-first-search over breadth-first-search is that the algorithm lends itself so well to recursion. Taking advantage of recursive calls, we don’t need to build our own stack, which we would have to otherwise.

We can formulate the following pseudo-code to represent depth-first comment tree compilation:

treeRootedAt(comment):
    for each <child> of <comment>:
        new-subtree = treeRootedAt(child)
        attach <new-subtree> to <comment>

In fact, in my Go code I have a function that is pretty much this, except for the explicit use of a map to find children quickly.

func threadCommentLevel(
    node comment, commentsWithParentID map[int][]comment,
) commentThread {
    threads := make([]commentThread, len(commentsWithParentID[node.ID]))
    for i, c := range commentsWithParentID[node.ID] {
        threads[i] = threadCommentLevel(c, commentsWithParentID)
    }
    return commentThread{
        comment: node,
        Children: threads,
    }
}

This has been a fun side project for me to work on. I was able to preserve the original WordPress site using my tool and creating a new hugo project from the exported Markdown / HTML.
I had never considered how comment threads could be generated and displayed. It’s one of the many things that most of us leave to external services, in this case blog-building websites, or for email, clients with message threading capabilities.

It is interesting to have a good reason to look under the hood of some of those services we all rely on.
Sometimes it is hard to motivate beginners in programming to learn the fundamentals. I think comment threading would make a very good project for beginners, even a good class project. On top of that, it is a case in point of the importance of data structures. We have mentioned trees, queues and stacks.
You know that old quote from Fred Brooks:

Show me your flow charts and conceal your tables and I shall continue to be mystified, show me your tables and I won’t usually need your flow charts; they’ll be obvious

Without run-time-dependencies, it is not possible for readers of the blog to add new comments. This was fine for my project, as preservation of the site was sought. For addition of comments to a live website, some form of database back-end is needed. A problem that is taken care of by blog engines like WordPress or comment engines like disqus.
An extremely interesting idea that is in a way cleaner for techies is Staticman, which leverages the GitHub account you may well have and co-opts it as a storage backend. Comments become Pull Requests! ↩︎
Two projects in particular were inspirations: wordpress-export-to-markdown (javascript) wordpressxml2jekyll.rb (ruby) ↩︎
like Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001), Introduction to Algorithms (2nd ed.), MIT Press and McGraw-Hill ↩︎

<< Newer	Go is a great Language for Object Oriented Programming
Older >>	The Chosen One - recruiting