04 August 2020

# Directed Acyclic Graphs

Wait, what?

Why start with Graphs at all, and even then a subset of them? Because, quoting Tony Stark, "sometimes you gotta run before you can walk."

# Intro

As mentioned, Directed Acyclic Graphs (or DAGs) are a subset of the graph data structure.

They differ from conventional graphs:

• DIRECTED: The edges have a direction associated with them.
• ACYCLIC: It has no directed cycles. Simply put, the same node may not be visited again.

# Use

DAGs are used extensively for workflows, particularly those involving data. Indeed the reason I had a use for them was for our upcoming ETL platform.

It must be mentioned here that I was first introduced to the concept of DAGs while I was implementing Apache Airflow.

# Implement

We will be using the `networkx` library.

``\$ pip install networkx``

Since, its a ETL pipeline, let's start simple. Two input `table`s joined with a `join` and pushing data to another output `table`.

``````>>> import networkx as nx

>>> graph = nx.DiGraph()

... ("table_1", "join"),
... ("table_2", "join"),
... ])
``````

Note: DiGraph is Directed Graph.

# Visualization

It is a major assistance to be able to visualize the graphs that we create so that we know we are on the right track.

Visualization can be done with `matplotlib`.

``\$ pip install matplotlib``
``````>>> import networkx as nx
>>> from matplotlib import pyplot

>>> graph = nx.DiGraph()

... ("input_table_1", "join"),
... ("input_table_2", "join"),
... ])

>>> pyplot.clf()
>>> pyplot.tight_layout()
>>> nx.draw_networkx(graph, arrows=True)
>>> pyplot.savefig("graph.png", format="PNG")
>>> pyplot.clf()``````

A PNG file will be created in the working directory. # More Nodes

About that `output` table.

``````>>> graph.add_edges_from([
... ("input_table_1", "join"),
... ("input_table_2", "join"),
... ("join", "output_table_1"),
... ])`````` # Validation

As the number of nodes and edges keep on increasing, it will be difficult to keep track of the graph graphically (pun unintentional). For times as these, one can:

• check that the graph is indeed directed.
``````>>> nx.is_directed(graph)
True``````
• check that the graph is directed AND acyclic.
``````>>> nx.is_directed_acyclic_graph(graph)
True``````

There are a ton of other functions on this library. Join me next time when I discuss topological ordering and why it is needed.