04 August 2020

Directed Acyclic Graphs

Wait, what?

Why start with Graphs at all, and even then a subset of them? Because, quoting Tony Stark, "sometimes you gotta run before you can walk."

Intro

As mentioned, Directed Acyclic Graphs (or DAGs) are a subset of the graph data structure.

They differ from conventional graphs:

• DIRECTED: The edges have a direction associated with them.
• ACYCLIC: It has no directed cycles. Simply put, the same node may not be visited again.

Use

DAGs are used extensively for workflows, particularly those involving data. Indeed the reason I had a use for them was for our upcoming ETL platform.

It must be mentioned here that I was first introduced to the concept of DAGs while I was implementing Apache Airflow.

Implement

We will be using the networkx library.

\$ pip install networkx

Since, its a ETL pipeline, let's start simple. Two input tables joined with a join and pushing data to another output table.

>>> import networkx as nx

>>> graph = nx.DiGraph()

... ("table_1", "join"),
... ("table_2", "join"),
... ])

Note: DiGraph is Directed Graph.

Visualization

It is a major assistance to be able to visualize the graphs that we create so that we know we are on the right track.

Visualization can be done with matplotlib.

\$ pip install matplotlib
>>> import networkx as nx
>>> from matplotlib import pyplot

>>> graph = nx.DiGraph()

... ("input_table_1", "join"),
... ("input_table_2", "join"),
... ])

>>> pyplot.clf()
>>> pyplot.tight_layout()
>>> nx.draw_networkx(graph, arrows=True)
>>> pyplot.savefig("graph.png", format="PNG")
>>> pyplot.clf()

A PNG file will be created in the working directory. More Nodes

... ("input_table_1", "join"),
... ("input_table_2", "join"),
... ("join", "output_table_1"),
... ]) Validation

As the number of nodes and edges keep on increasing, it will be difficult to keep track of the graph graphically (pun unintentional). For times as these, one can:

• check that the graph is indeed directed.
>>> nx.is_directed(graph)
True
• check that the graph is directed AND acyclic.
>>> nx.is_directed_acyclic_graph(graph)
True

There are a ton of other functions on this library. Join me next time when I discuss topological ordering and why it is needed.