Choosing the Right Data Structures

August 2013

About a year ago, I was using NetworkX (a Python package for studying graphs) for one of my projects. While I was working on it, I found that the topological sort implementation in NetworkX was very slow on large-scale instances. So I decided to take a look at it, hoping I can figure out what's wrong and fix it. In the end, I sped up considerably the implementation. Here is how I did it.

Please note that you don't need to understand how topological sort works to follow the rest of the article as this is just a pretext to realize that it's important to choose the right data structures when implementing algorithms.

Below is the original implementation of topological_sort_recursive in NetworkX.

def topological_sort_recursive(G, nbunch=None):
    if not G.is_directed():
        raise nx.NetworkXError("Topological sort not defined on undirected graphs.")

    # function for recursive dfs
    def _dfs(G, seen, explored, v):
        seen.add(v)
        for w in G[v]:
            if w not in seen:
                if not _dfs(G, seen, explored, w):
                    return False
            elif w in seen and w not in explored:
                # cycle Found--- no topological sort
                raise nx.NetworkXUnfeasible("Graph contains a cycle.")
        explored.insert(0, v) # inverse order of when explored
        return True

    seen = set()
    explored = []

    if nbunch is None:
        nbunch = G.nodes_iter()
    for v in nbunch:  # process all nodes
        if v not in explored:
            if not _dfs(G, seen, explored, v):
                raise nx.NetworkXUnfeasible("Graph contains a cycle.")
    return explored

The main issue in this implementation is that two crucial operations on the variable explored, which is a list of nodes, take linear time instead of taking a constant time if proper data structures were chosen.

The operations w not in explored (lines 12 and 24) and explored.insert(0, v) (line 15) are inefficient on a list because a list in CPython is implemented as an array. This means that both a membership test (in) and inserting at an arbitrary position (insert) usually take linear time. In the first case, one must perform a check on each element and in the second case one must shift all elements to the right by one position to insert an element at the first position. This leads to lots of unnecessary memory copies.

To prevent such situations, one should keep in mind the run-time of the operations on the standard data structures. If this is not familiar to you, this page is a good place to start.

Enough said, let's now look at how to speed up the topological sort implementation.

To fix the in operation, we simply need to change the type of explored to a set. This way, a membership test now takes a constant time because a set is implemented as a hash table in CPython.

Since we need to keep track of the order of the visited nodes (this is the purpose of the insert call explored.insert(0, v) at line 15) and a set is unordered, we must add another variable (order) of type list to which we insert the nodes. But instead of inserting a node at the beginning of the list, we insert it at the end. Appending to a list usually takes a constant time because we only need to insert a new element at the end. Finally, we need to reverse the list at the end of the algorithm to get the same order as in the original implementation.

Here is the new implementation:

def topological_sort_recursive(G, nbunch=None):
    if not G.is_directed():
        raise nx.NetworkXError("Topological sort not defined on undirected graphs.")

    def _dfs(v):
        ancestors.add(v)

        for w in G[v]:
            if w in ancestors:
                raise nx.NetworkXUnfeasible("Graph contains a cycle.")

            if w not in explored:
                _dfs(w)

        ancestors.remove(v)
        explored.add(v)
        order.append(v)

    ancestors = set()
    explored = set()
    order = []

    if nbunch is None:
        nbunch = G.nodes_iter()

    for v in nbunch:
        if v not in explored:
            _dfs(v)

    return list(reversed(order))

I did a similar refactoring for the non recursive implementation and benchmarked my new implementations on a graph with 199,496 nodes and 600,000 arcs. The results are quite impressive:

Implementation	Original time (s)	New time (s)
Recursive	4027.5	0.5
Non-Recursive	15.6	0.6

Before, it took more than a hour to apply the recursive topological sort on this instance and 15.6 seconds for the non-recursive version. It now takes less than a second in both cases.

My changes have been merged into the master branch, and have been released in version 1.8.

Hacker News Discussion

Reddit Discussion

Like this article? Get notified of new ones: