04 August 2020

Data Structures, Volume 1

Directed Acyclic Graphs

Wait, what?

Why start with Graphs at all, and even then a subset of them? Because, quoting Tony Stark, "sometimes you gotta run before you can walk."

Intro

As mentioned, Directed Acyclic Graphs (or DAGs) are a subset of the graph data structure.

They differ from conventional graphs:

  • DIRECTED: The edges have a direction associated with them.
  • ACYCLIC: It has no directed cycles. Simply put, the same node may not be visited again.

Use

DAGs are used extensively for workflows, particularly those involving data. Indeed the reason I had a use for them was for our upcoming ETL platform.

It must be mentioned here that I was first introduced to the concept of DAGs while I was implementing Apache Airflow.

Implement

We will be using the networkx library.

$ pip install networkx

Since, its a ETL pipeline, let's start simple. Two input tables joined with a join and pushing data to another output table.

Let's start with the root node and then keep adding other nodes.

>>> import networkx as nx

>>> graph = nx.DiGraph()

>>> graph.add_edges_from([
... ("table_1", "join"),
... ("table_2", "join"),
... ])

Note: DiGraph is Directed Graph.

Visualization

It is a major assistance to be able to visualize the graphs that we create so that we know we are on the right track.

Visualization can be done with matplotlib.

$ pip install matplotlib
>>> import networkx as nx
>>> from matplotlib import pyplot

>>> graph = nx.DiGraph()

>>> graph.add_edges_from([
... ("input_table_1", "join"),
... ("input_table_2", "join"),
... ])

>>> pyplot.clf()
>>> pyplot.tight_layout()
>>> nx.draw_networkx(graph, arrows=True)
>>> pyplot.savefig("graph.png", format="PNG")
>>> pyplot.clf()

A PNG file will be created in the working directory.

More Nodes

About that output table.

>>> graph.add_edges_from([
... ("input_table_1", "join"),
... ("input_table_2", "join"),
... ("join", "output_table_1"),
... ])

Validation

As the number of nodes and edges keep on increasing, it will be difficult to keep track of the graph graphically (pun unintentional). For times as these, one can:

  • check that the graph is indeed directed.
    >>> nx.is_directed(graph) 
    True
  • check that the graph is directed AND acyclic.
    >>> nx.is_directed_acyclic_graph(graph)
    True

There are a ton of other functions on this library. Join me next time when I discuss topological ordering and why it is needed.