imtoken官网app下载|dag

作者: imtoken官网app下载
2024-03-10 00:01:28

有向无环图 DAG - 知乎

有向无环图 DAG - 知乎切换模式写文章登录/注册有向无环图 DAGPeter 王广忠程序员,专业区块链讲解员DAG ,也就是 Directed Acyclic Graph 有向无环图,在区块链领域是一个非常有曝光度的技术。而有意思的是 DAG 其实并不是区块链,这个一会儿咱们会聊到。本文给出的只是简介,DAG 技术具体细节大家可以参考文中给出的那些使用 DAG 的知名项目。DAG 在图论中的本意DAG 是一种图,那么从图论的基本理论角度,DAG 是什么呢?咱们先从区块链说起。如果你有编程知识背景,肯定知道链表的概念,链表就是一条很多节点链接成的一条链,每个节点中包含指向前一个节点的链接。区块链之所以能连成一条链,是因为新区块中有指向上一个区块的指针,所以说区块链的数据结构是一个链表。但是区块链的问题就在于它是一条线,假设一个区块生成的时间是固定的,那么这样一条线的结构就会造成性能瓶颈。因为每隔这个固定时间,只允许有一个区块添加到链上。所以要提升区块链的性能,大概有两个思路,一个是缩短生成一个区块的时间,而对于采用了 DAG 技术的区块链项目,走的就是另外一个思路了,也就是改变数据的结构,让新数据的添加可以平行进行。DAG 数据结构不是链表,而是 Graph 图。Graph 上有很多节点,也叫做 Vertices 顶点,连接两个节点的叫做 edges 边。没错,就是咱们中小学数学课上学到的顶点和边的概念。链表,Tree ,图是三个复杂度递进的数据结构。链表就是一条有方向的线。Tree 是有分叉的,但是任意两个节点间只有一条路径能到达另外一点,也就是不能形成闭合的图形。而图是可以有闭合的图形的。当然,深入一点说,Tree 也属于 Graph 的一个特例了,这个我们不深究。DAG 最后一个字母 G 指的就是 Graph 图,那 D 和 A 是什么意思呢? D 对应单词是 Directed 有向,也就是有明确的方向的意思。A 节点中有指向 B 节点的指针,而 B 节点中是没有指向 A 新节点的指针的,如果画出来就是一个从 A 到 B 的单向的箭头。在 DAG 中,一个节点到另外一个节点的指向是单向的,这就是 D 有向的含义。再说 A ,A 对应的单词是 Acyclic 无环,意思是整张图上不允许出现沿着箭头从一个节点出发最后能又回到起点的情况。这样,我们就理解了,从图论的角度,什么是 DAG 了。区块链中使用 DAG 的基本方式下面来看看到了区块链项目中 DAG 怎么玩。我们从基于 DAG 如何来实现常见的区块链功能来聊。第一个功能,交易验证。常见的做法是,DAG 项目会要求后发起交易的人必须选择一个或多个之前的交易来验证。DAG 项目中不会把交易打包成区块,这样每个交易就是一个节点,交易之间的指针就是边,交易历史就形成了一个 DAG 。第二个共识。只有交易验证,保证不了系统的安全。如何有人发恶意交易,出现了各种争端,谁来解决呢?基本的共识思路是通过某种规则选择出几十个见证人,让见证人通过 BFT 协议或者其他规则来见证交易。Byteball 项目规则是选出12个实名制的见证人,IOTA 则采用了一个临时的中心化 Coordinator 机制。第三个是激励。DAG 类的项目中一般没有挖矿,通常的做法是一次性发行所有币。例如,Nano 项目就是通过在创世账户中直接发放的形式,来一次性发行所有的币的。参与验证的机器会获得代币激励。这些就是使用 DAG 类项目如何达成各项功能的方式了。DAG 的优点和缺点最后来聊一下 DAG 的优缺点。先说优点,最明显的就是高速度零费用。DAG 条件下,交易不会打包成区块。每笔交易都是单独被验证的,所以大量的交易验证都可以并行进行。理论上交易吞吐量是没有上限的。用户互相之间互相验证交易,于是 DAG 项目可以实现0手续费交易。这些优点对于 IOTA 这样的服务物联网的区块链项目是非要必要的,因为 IOTA 要处理的是机器和机器之间的交易,通常都是非常高频和小额的交易,如果过程中再收取手续费,就没法进行了。再说说使用 DAG 的缺点。DAG 这个思路应该说安全性还是有待验证的,各种安全漏洞不断的被曝出。中心化问题也相对比较严重,这一点从上面提过的几个项目达成共识的方式可以看到。总之,DAG 技术优势明显,但是还是有待验证。总结对于 DAG 有向无环图的这篇简介,内容到这里了。总结起来,DAG 就是一个从任何节点出发,只要按照指针方向走,无论选择哪种路径都不能回到起点的图。DAG 应用到区块链领域可以实现非常高的性能和零手续费,但是最终的可行性还有待验证。参考:https://www.youtube.com/watch?v=LtWUJtnQbKs&t=296shttps://en.wikipedia.org/wiki/Directedacyclicgraphhttps://en.wikipedia.org/wiki/Tree(graphtheory)https://www.coinbureau.com/education/directed-acyclic-graphs-dags/发布于 2018-12-14 21:46区块链(Blockchain)​赞同 55​​4 条评论​分享​喜欢​收藏​申请

叶胜超:一分钟了解DAG(有向无环图)技术!(51) - 知乎

叶胜超:一分钟了解DAG(有向无环图)技术!(51) - 知乎首发于叶胜超区块链切换模式写文章登录/注册叶胜超:一分钟了解DAG(有向无环图)技术!(51)叶胜超区块链爱好者,《微信营销独孤九剑》作者什么是DAG?DAG的全称为“Directed Acyclic Graph”,中文意思为:有向无环图,它由有限个顶点和“有向边”组成,从任意顶点出发,经过若干条有向边,都无法回到该顶点,这种图就是有向无环图。DAG是有向无环图,胜超之前说的默克尔树(二叉树),就是有向无环图的一种,有向无环图又是有向图的一种,目前DAG技术最有名的项目是IOTA,IOTA的主要创新是Tangle(缠结)模型。DAG技术是在什么背景下诞生的?目前公链技术一直存在处理速度慢,费用高等问题,如果没有高效的公链,整个区块链产业的发展就会受到制约,在这种背景下,DAG技术应运而生。在DAG系统中,没有矿工,也没有区块的概念,因为没有区块,所以没有区块大小问题,相比传统的区块链技术,具有更快的交易速度,以及更强的可扩展性。DAG技术有什么特点?1,交易速度快DAG实现的局部处理和并行结算,可以使得交易速度大幅度提升。2,拓展性强因为各个节点,无需等待同步其他的节点数据,使得记账节点很容易答复延展,因此DAG很适用于物联网类项目。3,作恶难度更大相比于链式结构,在DAG图式结构中恶意修改的难度会大很多,因为DAG拥有着很多的出度和入度,假如要修改某一个节点,那么对应的出入度都要进行修改。总结一下:相对于传统线性区块链,DAG优势在于速度快,吞吐量高,由于DAG采用的图式网络,每个节点无需等待其它节点的数据,可以异步处理交易,避免了因等待造成的时间浪费。对于链式网络而言,不是节点的处理能力不强,只是链式结构不能并行计算,浪费的时间其实主要为等待时间。目前DAG应用场景还不像传统区块链那么广泛,但是DAG技术的优势已经崭露头角,相信随着DAG的普及,DAG项目的实验也会带给我们更多的惊喜。此文属于叶胜超区块链基础普及系列,作者:叶胜超,转载请注明出处,谢谢!关注叶胜超,每天了解一个知识点,日积月累变老鸟!投资箴言:行情总在绝望中诞生,在半信半疑中成长,在憧憬中成熟,在希望中毁灭。作者简介:我是叶胜超,一个把自己姓名当成品牌经营的终身学习者,一个坚持每天5点起床跑步的终身践行者.希望和你成为朋友,我的微信:shengchao8 (公众号/微博:叶胜超区块链)熊市学习,牛市赚钱,学习区块链,百度“叶胜超区块链”,希望和你在熊市一起学习,一起成长。如果你想倾家荡产,有四大捷径:追涨杀跌;期货杠杆;融资融币;短线神操作。当然,人生颠峰也有四条大道:踏实工作;闲钱投资;熊市定投 ,牛市定抛;按时吃饭睡觉。此乃币圈生存法则,非绝世高手不得无视,币圈一天,人间十年,其凶残程度古今罕有,谨记生存法则可保不死!发布于 2019-09-28 09:14DAG 区块链技术:原理与实践(书籍)区块链(Blockchain)​赞同 14​​2 条评论​分享​喜欢​收藏​申请转载​文章被以下专栏收录叶胜超区块链叶胜超区块链,为学习区块链而生的

有向无环图(DAG)的温故知新-腾讯云开发者社区-腾讯云

图(DAG)的温故知新-腾讯云开发者社区-腾讯云半吊子全栈工匠有向无环图(DAG)的温故知新关注作者腾讯云开发者社区文档建议反馈控制台首页学习活动专区工具TVP最新优惠活动文章/答案/技术大牛搜索搜索关闭发布登录/注册首页学习活动专区工具TVP最新优惠活动返回腾讯云官网半吊子全栈工匠首页学习活动专区工具TVP最新优惠活动返回腾讯云官网社区首页 >专栏 >有向无环图(DAG)的温故知新有向无环图(DAG)的温故知新半吊子全栈工匠关注发布于 2021-08-06 14:26:598.1K1发布于 2021-08-06 14:26:59举报文章被收录于专栏:喔家ArchiSelf喔家ArchiSelf当我们学习数据结构的时候,总是觉得很枯燥,而当我们解决实际问题的时候,又往往因为对数据结构了解的匮乏而束手无策。从问题中来,到问题中去,在某一点上的深入思考并且不断的实践积累,或许是个笨办法,但笨办法总是比没办法好一些。本文是老码农对DAG的随手笔记,积累成文。

什么是DAG?DAG,Directed Acyclic Graph即「有向无环图」。从计算机的视角来看,DAG 是一个图,图与数组、队列、链表等一样,都是是一种数据结构。图是由顶点和连接顶点的边构成的数据结构,在计算机科学中,图是最灵活的数据结构之一,很多问题都可以使用图模型进行建模求解。例如,人与人之间的社交网络、分析计算机网络的拓扑结构已确定两台计算机是否可以通信、物流系统中找到两个地点之间的最短路径等等。回顾一下图的相关概念:顶点:图中的一个点边:连接两个顶点的线段相邻:一个边的两头顶点成为相邻度数:由一个顶点出发,有几条边就称该顶点有几度路径:通过边来连接,按顺序的从一个顶点到另一个顶点中间经过的顶点集合简单路径:没有重复顶点的路径环:至少含有一条边,并且起点和终点都是同一个顶点的路径简单环:不含有重复顶点和边的环无环图:是一种不包含环的图连通图:如果一个图中,从任意顶点均存在一条路径可以到达另一个任意顶点,该图称为连通图稀疏图:图中每个顶点的度数都不较小稠密图:图中每个顶点的度数都较大如果图中的边可以是有方向的,边的方向性表明了两个顶点直接的不同不关系。例如,地图应用中必须存储单行道的信息,避免给出错误的方向。如果图中任意两个顶点之间的边都是有向边,这个图就是有向图。如果有一个非有向无环图,且A点出发向B经C可回到A,形成一个环。将从C到A的边方向改为从A到C,则变成有向无环图,即DAG。按照数学上的定义,DAG是一个没有有向循环的、有限的有向图。具体来说,它由有限个顶点和有向边组成,每条有向边都从一个顶点指向另一个顶点;从任意一个顶点出发都不能通过这些有向边回到原来的顶点。也就是说,它由 顶点 Vertex 和 边 Edge (也称为弧)组成,每条边都从一个顶点指向另一个顶点,沿着这些顶点的方向 不会形成一个闭合的环 。DAG与代数拓扑学中的偏序集(Partially Ordered Set,Poset)有紧密联系。偏序关系相同的任意两个图会有相同的拓扑排序集。事实上,任何一个DAG都唯一对应一个Poset, 而所有的Poset都是DAG,所以它们在本质上是一种事物。DAG具有一些很好性质,比如很多动态规划的问题都可以转化成DAG中的最长路径、最短路径或者路径计数的问题。DAG的特性DAG 具有空间结构和时间序列的混合特性,与数据结构中的树密切相关,其拓扑排序和最短路径的计算,都有着独到的特点。DAG 与树的关系DAG是树的泛化,也是ploy tree的泛化。tree 是层次化,按照类别或者特性可以细分到原子单元,树其实就是一种无环连通图。DAG 从源开始细分,但是中间可以有合,有汇总。D就是可以合的点。因为有向图中一个点经过两种路线到达另一个点未必形成环,因此有向无环图未必能转化成树,但任何有向树均为有向无环图。DAG 的拓扑排序拓扑排序是一个可以把所有的顶点排序的算法,它排序的依据是深度优先搜索算法的完成时间。可以根据拓扑排序来计算有向无环图(的单源最短路径),因为拓扑排序正好是建立在无环的基础上,在这个图中没有负权重边以及回路边。拓扑排序是将图中所有顶点排成一个线性序列,使得图中任意一对顶点u和v,若边(u,v)∈E(G),则u在线性序列中出现在v之前。通常,这样的线性序列称为满足拓扑次序的序列,简称拓扑序列。简单的说,由某个集合上的一个偏序得到该集合上的一个全序,这个操作称之为拓扑排序。对于一个DAG,可以这样确定一个图中顶点的顺序:对于所有的u、v,若存在有向路径u-->v,则在最后的顶点排序中u就位于v之前。这样确定的顺序就是一个DAG的拓扑排序。拓扑排序的特点如下:(1)所有可以到达顶点v的顶点u都位于顶点v之前;(2)所有从顶点v可以到达的顶点u都位于顶点v之后。另外,只有有向无环图才存在拓扑排序,一个图的拓扑顺序不唯一。实现拓扑排序主要有两种方法:入度表和DFS。在DFS实现拓扑排序时,用栈来保存拓扑排序的顶点序列;并且保证在某顶点入栈前,其所有邻接点已入栈。下面列出的是用C语言实现的入度表拓扑排序示例代码:#define MAX_NODE 1000

#define MAX_EDGE_NUM 100000

struct Edge{

int to;

int w;

int next;

};

Edge gEdges[MAX_EDGE_NUM];

int gHead[MAX_NODE];

bool gVisited[MAX_NODE];

int gInDegree[MAX_NODE];

int gEdgeCount;

void InsertEdge(int u,int v,int w){

int e = gEdgeCount++;

gEdges[e].to = v;

gEdges[e].w = w;

gEdges[e].next = gHead[u];

gHead[u] = e;

gInDegree[v] ++; //入度加1

}

void TopoSort(int n/*节点数目*/){

queue zero_indegree_queue;

for (int i = 0; i < n; i++){

  if (gInDegree[i] == 0)

zero_indegree_queue.push(i);

}

memset(gVisited,false,sizeof(gVisited));

while (!zero_indegree_queue.empty()){

int u = zero_indegree_queue.front();

zero_indegree_queue.pop();

gVisited[u] =true;

//输出u

for (int e = gHead[u]; e != -1; e = gEdges[e].next){

int v = gEdges[e].to;

gInDegree[v] --;

if (gInDegree[v] == 0){

zero_indegree_queue.push(v);

}

}

}

for (int i = 0; i < n; i++){

if (!gVisited[i]){

//环!无法形成拓扑序

}

}

}复制DAG 的单源最短路径图中节点的单源最短路径可以使用Dijkstra,BellmanFord, SPFA算法,而对于有向无环图DAG来说,可以通过简单的动态规划来进行求解。DAG的独特之处是所有节点可以线性化(拓扑序列),使得所有边保持从左到右的方向。给定一个DAG和一个源点,可以得到该源点到其他所有的顶点的最短路径。如果是无负权,可以用djistra算法完成。但如果存在负权,则不行。同时,djistra算法效率并不高,既然是DAG,则可以利用拓扑排序的结果求出给定源点的最短路径,其时间复杂度是线性时间复杂度O(V+E)。DAG 的单源最短路径算法原理如下:1) 初始化dist[] = {INF, INF, ….} ,dist[s] = 0 是单源顶点 2)创建所有定点的拓扑排序3) 对拓扑排序中的每个顶点u 做如下处理,即处理u 的每个相邻顶点:if (dist[v] > dist[u] + weight(u, v)) dist[v] = dist[u] + weight(u, v) 具体地, 用Python 实现的DAG最短路径算法代码示例如下:# Python program to find single source shortest paths

# for Directed Acyclic Graphs Complexity :OV(V+E)

from collections import defaultdict

# Graph is represented using adjacency list. Every

# node of adjacency list contains vertex number of

# the vertex to which edge connects. It also contains

# weight of the edge

class Graph:

def __init__(self,vertices):

self.V = vertices # No. of vertices

# dictionary containing adjacency List

self.graph = defaultdict(list)

# function to add an edge to graph

def addEdge(self,u,v,w):

self.graph[u].append((v,w))

# A recursive function used by shortestPath

def topologicalSortUtil(self,v,visited,stack):

# Mark the current node as visited.

visited[v] = True

# Recur for all the vertices adjacent to this vertex

if v in self.graph.keys():

for node,weight in self.graph[v]:

if visited[node] == False:

self.topologicalSortUtil(node,visited,stack)

# Push current vertex to stack which stores topological sort

stack.append(v)

''' The function to find shortest paths from given vertex.

It uses recursive topologicalSortUtil() to get topological

sorting of given graph.'''

def shortestPath(self, s):

# Mark all the vertices as not visited

visited = [False]*self.V

stack =[]

# Call the recursive helper function to store Topological

# Sort starting from source vertice

for i in range(self.V):

if visited[i] == False:

self.topologicalSortUtil(s,visited,stack)

# Initialize distances to all vertices as infinite and

# distance to source as 0

dist = [float("Inf")] * (self.V)

dist[s] = 0

# Process vertices in topological order

while stack:

# Get the next vertex from topological order

i = stack.pop()

# Update distances of all adjacent vertices

for node,weight in self.graph[i]:

if dist[node] > dist[i] + weight:

dist[node] = dist[i] + weight

# Print the calculated shortest distances

for i in range(self.V):

print ("%d" %dist[i]) if dist[i] != float("Inf") else "Inf" ,

g = Graph(6)

g.addEdge(0, 1, 5)

g.addEdge(0, 2, 3)

g.addEdge(1, 3, 6)

g.addEdge(1, 2, 2)

g.addEdge(2, 4, 4)

g.addEdge(2, 5, 2)

g.addEdge(2, 3, 7)

g.addEdge(3, 4, -1)

g.addEdge(4, 5, -2)

# source = 1

s = 1

print ("Following are shortest distances from source %d " % s)

g.shortestPath(s)

# This code is contributed by Neelam Yadav

复制DAG的应用鉴于DAG的一般性,具有广泛的应用场景。DAG 的存储结构用例作为数据结构,DAG 在数据存储方面非常著名的使用场景就是Git。Git采用了Merkle Tree+ DAG作为一个组合的数据结构Merkle DAG,来实现了分布式的的版本控制。IPFS 参考了Git的实现思想,同样使用了Merkle DAG 作为核心的数据结构,即后来称为IPLD, 在 IPFS 生态系统的“蜂腰”模型中处于腰的位置,如下图所示: Merkle Tree通常也被称作Hash Tree,顾名思义,就是存储hash值的一棵树。在Merkle Tree 的基础上,Merkle DAG是一个有向无环图,可以简单的理解成一棵树,且没有Merkle Tree那样严格的限制(例如平衡树),其特点是:父节点的哈希值由子节点的哈希值决定,即父节点的哈希值是由子节点的哈希值拼接得来的字符串哈希而成。父节点中包含指向子节点的信息。Merkle DAG保留了Merkle Tree的精华部分,即任何一个下层节点改动,都将导致上层节点的哈希值变动,最终的根节点的哈希值也变动。在IPFS网络中,存储文件时,首先会将文件切片,切割成256KB大小的文件。之后循环调用(MerkleDAG.Add)方法构建文件MerkleDAG。文件hash值创建流程如下: 将切片之后的文件进行sha-256运算 将运算结果选取0~31位 将选取结果根据base58编码,运算结果前追加Qm 即为最后结果作为文件的46位hash值。在IPFS中,使用DAGService维护MerkleDAG,为MerkleDAG提供删除和增加的权限。其中的Merkle DAG为多叉树结构,最多为174叉树。通过Merkle DAG, IPFS能够轻松确保和验证P2P格式的计算机之间共享数据的完整性,这使得它们对系统非常有用。DAG 的因果关系用例以探讨影响因素或者控制偏倚为目的回归模型,要求自变量和因变量往往存在着因果关系,所以自变量筛选首先需要考虑自变量能否纳入到模型,严格挑选自变量进入模型。DAG可以说是回归分析的灵魂所在,是最高指导方针。基于DAG构建因果关系网络,从而找到合适进入模型的自变量。箭头发出对象为因,箭头指向为果。所有变量因果关系通过方向线形成的单向网络,即DAG。例如,贝叶斯网络是表示多个概率事件的关联网络。顶点表示事件,后续事件的发生可能性则可以通过其在有向无环图的前驱节点的发生概率计算出来。动态规划的DAG 实现什么是动态规划呢?问题可以分解成若干相互联系的阶段,在每一个阶段都要做出决策,全部过程的决策是一个决策序列。要使整个活动的总体效果达到最优的问题,称为多阶段决策问题。动态规划就是解决多阶段决策最优化问题的一种思想方法。将所给问题的过程,按时间或空间特征分解成若干相互联系的阶段,以便按次序去求每阶段的解。各阶段开始时的客观条件叫做状态。当各段的状态取定以后,就可以做出不同的决定,从而确定下一阶段的状态,这种决定称为决策。具体地,动态规划的递推需要一个线性或者树形的顺序,对于DAG,可以将节点按照拓扑顺序来进行线性化。具体地,对于当前的节点v,在拓扑序列中向前查找,可能找到一些可以到达该顶点的其他节点,然后利用 dist[v] = min{dist[u] + w[u][v] | u 可以到达v}来进行动态规划的递推。基于DAG 调度用例在有相互依赖的调度系统中,DAG 有着非常典型的应用。这里以Spark 为例进行说明。在Spark中的每一个操作生成一个RDD,RDD之间形成一条边,最后这些RDD和他们之间的边组成一个有向无环图,这个就是DAG。Spark中的RDD依赖关系分为两种:窄依赖(Narrow Dependencies)与宽依赖(Wide Dependencies,源码中称为Shuffle Dependencies)会根据宽依赖窄依赖来划分具体的Stage,而依赖有2个作用:用来解决数据容错的高效性;其二用来划分stage。原始的RDD通过一系列的转换就形成了DAG,有了可计算的DAG,Spark内核下一步的任务就是根据DAG图将计算划分成任务集,也就是Stage,这样可以将任务提交到计算节点进行真正的计算。Spark计算的中间结果默认是保存在内存中的,Spark在划分Stage的时候会充分考虑在分布式计算中可流水线计算的部分来提高计算的效率,而在这个过程中Spark根据RDD之间依赖关系的不同将DAG划分成不同的Stage。对于窄依赖,partition的转换处理在一个Stage中完成计算。对于宽依赖,由于有Shuffle的存在,只能在parent RDD处理完成后,才能开始接下来的计算,因此宽依赖是划分Stage的依据。Spark 执行时的处理流程如下:1)用户代码定义RDD的有向无环图RDD上的操作会创建新的RDD,并引用它们的父节点,这样就创建了一个图。2)行动操作把有向无环图强制转译为执行计划当调用RDD的一个行动操作时,这个RDD就必须被计算出来。这也要求计算出该RDD的父节点。Spark调度器提交一个作业来计算出所有必要的RDD。这个作业会包含一个或多个步骤,每个步骤其实也就是一波并行执行的计算任务。一个步骤对应有向五环图中的一个或多个RDD,一个步骤对应多个RDD是因为发生了流水线执行。3)任务于集群中调度并执行 步骤是按顺序处理的,任务则独立的启动来计算出RDD的一部分。一旦作业的最后一个步骤结束,一个行动操作也就执行完了。DAG 的区块链用例最早在区块链中引入DAG概念作为共识算法是在2013年,bitcointalik.org由ID为avivz78的以色列希伯来大学学者提出,也就是GHOST协议,作为比特币的交易处理能力扩容解决方案;Vitalik在以太坊紫皮书描述的POS共识协议Casper,也是基于GHOST POW协议的POS变种。后来,有人提出用DAG的拓扑结构来存储区块,解决区块链的效率问题。区块链只有一条单链,打包出块无法并发执行。如果改变区块的链式存储结构,变成DAG的网状拓扑就可以实现并发写入。在区块打包时间不变的情况下,网络中可以并行打包N个区块,网络中的交易效率就提升了N倍。那个时候,DAG跟区块链的结合依旧停留在类似侧链的解决思路,交易打包可以并行在不同的分支链条进行,达到提升性能的目的。此时,DAG还是区块的概念。2015年9月,Sergio Demian Lerner发表了 《DagCoin: a cryptocurrency without blocks》一文,提出了DAG-Chain的概念,首次把DAG网络从区块打包这样粗粒度提升到了基于交易层面。DagCoin的思路,让每一笔交易都直接参与维护全网的交易顺序。交易发起后,直接广播全网,这样省去了打包交易出块的时间。也就是说,交易发起后直接广播网络确认,理论上效率得到了质的飞跃。进一步,DAG演变成了完全抛弃区块链的一种解决方案。2016年7月,基于IOTA横空出世,随后ByteBall也登场了,IOTA和Byteball是头一次DAG网络的真正技术实现,此时,号称无块之链、独树一帜的DAG链家族雏形基本形成。从某种程度上说,DAG可能是面向未来的新一代区块链,从单链进化到树状和网状、从区块粒度细化到交易粒度、从单点跃迁到并发写入,这是区块链从容量到速度的一次革新。一句话小结有向无环图有许多科学的和计算的应用,从生物学到社会学再到我们所熟悉的计算机领域,而且,DAG的原理并不复杂,关键在于把目标问题抽象为DAG,进而使用DAG的相关特性来解决问题。【参考资料与关联阅读】 https://www.lwlwq.com/post-dag.htmlhttps://zhuanlan.zhihu.com/p/275992858https://www.jianshu.com/p/84cdab3030cbhttps://en.wikipedia.org/wiki/Directedacyclicgraph

https://www.cnblogs.com/gtarcoder/p/4900516.htmlhttps://www.geeksforgeeks.org/shortest-path-for-directed-acyclic-graphs/ES的一知半解NFT 的一知半解温故知新,HTTP/2服务可用性的一知半解对AI产品经理的一知半解人工智能伦理学的一知半解

本文参与 腾讯云自媒体分享计划,分享自微信公众号。原始发表:2021-08-01,如有侵权请联系 cloudcommunity@tencent.com 删除spark数据结构区块链https本文分享自 喔家ArchiSelf 微信公众号,前往查看如有侵权,请联系 cloudcommunity@tencent.com 删除。本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!spark数据结构区块链https评论登录后参与评论0 条评论热度最新登录 后参与评论推荐阅读LV.关注文章0获赞0目录什么是DAG?DAG的特性DAG 与树的关系DAG 的拓扑排序DAG 的单源最短路径DAG的应用DAG 的存储结构用例DAG 的因果关系用例动态规划的DAG 实现基于DAG 调度用例DAG 的区块链用例一句话小结相关产品与服务区块链云链聚未来,协同无边界。腾讯云区块链作为中国领先的区块链服务平台和技术提供商,致力于构建技术、数据、价值、产业互联互通的区块链基础设施,引领区块链底层技术及行业应用创新,助力传统产业转型升级,推动实体经济与数字经济深度融合。产品介绍2024新春采购节领券社区专栏文章阅读清单互动问答技术沙龙技术视频团队主页腾讯云TI平台活动自媒体分享计划邀请作者入驻自荐上首页技术竞赛资源技术周刊社区标签开发者手册开发者实验室关于社区规范免责声明联系我们友情链接腾讯云开发者扫码关注腾讯云开发者领取腾讯云代金券热门产品域名注册云服务器区块链服务消息队列网络加速云数据库域名解析云存储视频直播热门推荐人脸识别腾讯会议企业云CDN加速视频通话图像分析MySQL 数据库SSL 证书语音识别更多推荐数据安全负载均衡短信文字识别云点播商标注册小程序开发网站监控数据迁移Copyright © 2013 - 2024 Tencent Cloud. All Rights Reserved. 腾讯云 版权所有 深圳市腾讯计算机系统有限公司 ICP备案/许可证号:粤B2-20090059 深公网安备号 44030502008569腾讯云计算(北京)有限责任公司 京ICP证150476号 |  京ICP备11018762号 | 京公网安备号11010802020287问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档Copyright © 2013 - 2024 Tencent Cloud.All Rights Reserved. 腾讯云 版权所有登录 后参与评论100

从区块链到DAG(二)--DAG的基本结构 - 知乎

从区块链到DAG(二)--DAG的基本结构 - 知乎首发于DAG介绍切换模式写文章登录/注册从区块链到DAG(二)--DAG的基本结构区块链新观​DAG(Directed Acyclic Graph),中文译作“有向无环图”,作为一种新的底层账本结构被重新提出来。“有向”指有方向,“无回”指没有回路。在第一篇文章里我们通过讨论区块链提炼了两个关键词:账本结构和共识机制。这篇文章将着重从结构上介绍一下DAG和区块链的关系。在后续的文章里将依次介绍与这种结构相对应的共识机制。 1 DAG长什么样我都非常熟悉Merkle树的这种数据结构,比特币每个区块的内部就是用Merkle树来记录交易信息的,见图片1。图片1 区块结构来源:邹均 张海宁 唐屹 李磊,《区块链技术指南》,机械工业出版社可以看出来,Merkle树属于一种有向树结构,树里的每个顶点只能指向一个之前的顶点,整个数据有个明显流动方向。DAG结构则可以允许每个顶点指向多个之前的顶点,整个数据流也有一个明显的方向。另一种数据结构为有向图,与DAG不同的是有向图允许有数据回流,整个结构的数据流不是很明显。三者之间的区别见图片2。图片2 有向树、DAG图和有向图示意来源:有向无环图,https://baike.baidu.com/item/%E6%9C%89%E5%90%91%E6%97%A0%E7%8E%AF%E5%9B%BE/10972513?fr=aladdin 2 区块链是一种特殊的DAG结构在对DAG的结构有一个直观的认识之后,我们来梳理一下为什么认为区块链是一种特殊的DAG结构?在《从区块链到DAG(一)--区块链的账本结构和共识机制》一文中讲到,区块链是否分叉和出块速度以及广播速度有关。当出块速度超过广播速度的时候,会出现多个区块同时在广播的情况,分叉也就产生了,分叉越多安全性越差。比特币为了减少分叉,在性能和安全性中找到的平衡点为:每十分钟出一个块。现在我们假定每次出块的时间足够长,长到不会出现前一个区块还没广播结束就有新区块挖出的情况。那么这个区块链的结构就是一条单链,如图3。图片3 单链结构来源:From Blockchains to BlockDAGs, https://www.youtube.com/watch?v=tk38AAV_whw实际上,由于网络延迟等原因难免出现分叉的情况,所以实际的区块链结构会如图片4所示,再通过账本共识只取其中一条有效的主链(黄色),剩下的区块(红色)里的交易信息都是无效的,不会被采纳。图片4 结合账本共识的区块链结构来源:From Blockchains to BlockDAGs, https://www.youtube.com/watch?v=tk38AAV_whw现在先抛开账本共识,即先不考虑如何选取有效的主链,单从底层网络结构上看,一个典型的区块链结构如图片5所示,一个典型的DAG结构如图6所示。图片5 区块链结构来源:From Blockchains to BlockDAGs, https://www.youtube.com/watch?v=tk38AAV_whw图片6 DAG结构来源:From Blockchains to BlockDAGs, https://www.youtube.com/watch?v=tk38AAV_whw可以看到,两种结构唯一的不同就是DAG的区块可以指向之前的多个区块,而区块链的只能指向之前的唯一的区块。具体来说区块链的区块头只能包含一个区块的哈希值,指向唯一的父区块;而DAG结构下区块的区块头可以包含多个区块的哈希值,指向不同的前代区块。如图片7所示。图片7 区块链和DAG结构的区块头指向来源:DAG的妙用(三)——比特币协议的扩展,https://mp.weixin.qq.com/s/KeQDVQMJLQlyXQcywfE8sQ现在我们引入分叉系数k,指代网络可以允许的分叉个数。当k=0的时,整个网络不允许分叉,如图3。这种不允许分叉的网络就是区块链。比如前一篇文章里提到的比特币符合这个定义;以太坊虽然有叔区块的分叉,但是这些叔区块仅仅用作判断主链权重的计数,他们最后是不会被添加到主链上的(叔区块记录的交易信息不计入主链),所以以太坊也符合这个定义。DAG网络的k值一定是大于0的整数。所以从结构上看DAG是区块链结构的拓展,区块链是一种特殊的、简化的DAG。 3 DAG结构对性能的影响简单来说,我们可以把DAG看成一种允许分叉的网络结构,允许的分叉个数由分叉系数k决定。那么一个允许分叉的网络到底意味着什么?意味着出块的速度可以超过广播速度。这一方面导致单位时间内打包的交易变多了;另一方面当一个区块A在被全网广播的时候,另一个分叉区块B也在被全网广播,最后一些节点只会确认A,另一些节点只会确认B,所以DAG允许网络中的节点在同一时间记录的不一样的信息。这两方面综合起来就使DAG呈现出高并发,弱同步的特点。DAG是一种异步记账,这种记账方式可以极大的提高网络处理信息的速度,即TPS。而区块链是一种强同步记账的网络,它要求网络中的每个节点在同一时间记录相同的信息。然而这一要求往往限制住了区块链网络处理信息的能力,使TPS比较低。那么问题来了,这种允许分叉的异步记账网络要用什么样的共识?在上一篇文章中我们提到共识机制分为出块共识和账本共识。DAG的出块共识可以和区块链一样,比如也采用POW,由于允许分叉,所以出块时间可以被设置的非常短。出块共识也可以和区块链不一样比如IOTA项目,直接取消了打包出块这一过程,只要发生交易就立刻写入网络(DAG图中的每个方块不是区块而是一笔笔的交易),由此获得超高速处的交易处理的能力。DAG的账本共识要比区块链复杂得多。要如何预防节点作恶?当出现两个相互矛盾的交易时要怎么筛选?如何防止“双花”?随着底层网络结构的复杂化,账本共识也被赋予更高的要求。在后续的文章中将会重介绍。参考资料:[1] 邹均 张海宁 唐屹 李磊,《区块链技术指南》,机械工业出版社[2] From Blockchains to BlockDAGs, https://www.youtube.com/watch?v=tk38AAV_whw[3] DAG的妙用(三)——比特币协议的扩展,https://mp.weixin.qq.com/s/KeQDVQMJLQlyXQcywfE8sQ[4] Jeff Zhou:DAG高速异步区块链技术,https://www.jianshu.com/p/45d73e0e74ec发布于 2021-10-28 13:20区块链(Blockchain)比特币 (Bitcoin)结构设计​赞同 15​​1 条评论​分享​喜欢​收藏​申请转载​文章被以下专栏收录DAG介绍关于DAG在区块链中

如何理解DAG数据结构,并且用浅显易懂的话说出来? - 知乎

如何理解DAG数据结构,并且用浅显易懂的话说出来? - 知乎首页知乎知学堂发现等你来答​切换模式登录/注册比特币 (Bitcoin)区块链(Blockchain)如何理解DAG数据结构,并且用浅显易懂的话说出来?我们孤立地看待DAG,它就是DAG,我们把它和Hashgraph放在一起,它就变成了一种奇特的结构,从技术上看,它就是技术,从运营上看,它立足实际需求…显示全部 ​关注者31被浏览28,228关注问题​写回答​邀请回答​好问题 2​1 条评论​分享​8 个回答默认排序小羊开门呐是我对未知的事物保持好奇、终身学习​ 关注DAG:Directed Acyclic Graph,中文意为“有向无环图”。有向无环图是一种存储数据的方式。“有向”指所有数据顺着同一方向存储;“无环”指数据结构间不构成循环。像条毛线织的围巾,可以一直编下去。为什么要连线、标箭头?因为使用者每发起一笔交易,必须验证之前的两笔交易。这很像让一个孤儿自己选择养父母。DAG是孤儿的世界,每笔交易天生是孤儿,但养父母不能随便乱选,他们必须根正苗红,否则孤儿自己就不会被下一代选为父母,不被选择意味着从此消失。如果一笔交易不被后来的交易所验证,它就会变成真正的孤儿,从此在账本里失去合法性。DAG是一种数据存储结构,从它被发明的30多年来一直都有人使用,本身并没有问题。但它和区块链的区别在于DAG没有传统意义上的共识,每笔交易的可信与否取决于相信这笔交易的人数。所以采用DAG技术的核心问题在于如何保护全网达成的一致。编辑于 2018-06-22 17:33​赞同 5​​添加评论​分享​收藏​喜欢收起​乐振声​ 关注因为不是技术,只能从哲学角度帮助你向用户阐释:辩证唯物理论认为世界是联系的。传统互联网技术下数据图像并非完全关联,是模拟技术。按设定的规律运行,比如鱼碰到障碍游动,但只限于多少米内,回归到初始状态。这样的虚拟现实技术就无法实现现实行为影响虚拟环境,因为系统按照特定标准设置了回归值。即便加了Ai自我学习仍然陷入特定数据范畴内的变化。区块链数据是相互统一相互联系的,DAG区块链技术如女娲捏泥巴成人形一样,把混乱的数据构成具像形态,又实现了对实际事物组成数据的控制,但这种控制也是单向不可逆的,技术给了它们生命,但组成的事物仍然会受到其他事物的影响,我们称之为这个事物的“运”。当这个事物以及周遭环境产生“命运”,这个虚拟世界才会受到我们控制者的扰动并向真实历史演进,这种实验多数是混沌和无序的,但也会产生突变,突变后的事物会朝着一条主链不断发展,那就是历史。如同我此刻说出的理论,影响了一堆程序开发者,也可能成为历史。如果题主觉得这个解释不好理解,可以折叠哈哈 发布于 2018-06-20 23:54​赞同 1​​添加评论​分享​收藏​喜欢

算法精解:DAG有向无环图 - 一面千人 - 博客园

算法精解:DAG有向无环图 - 一面千人 - 博客园

会员

周边

新闻

博问

AI培训

云市场

所有博客

当前博客

我的博客

我的园子

账号设置

简洁模式 ...

退出登录

注册

登录

一面千人

衣不鲜,体不健,学而时习,悠哉悠哉!

博客园

首页

新随笔

订阅

管理

算法精解:DAG有向无环图

DAG是公认的下一代区块链的标志。本文从算法基础去研究分析DAG算法,以及它是如何运用到区块链中,解决了当前区块链的哪些问题。

关键字:DAG,有向无环图,算法,背包,深度优先搜索,栈,BlockChain,区块链

图是数据结构中最为复杂的一种,我在上大学的时候,图的这一章会被老师划到考试范围之外,作为我们的课后兴趣部分。但实际上,图在信息化社会中的应用非常广泛。图主要包括:

无向图,结点的简单连接

有向图,连接有方向性

加权图,连接带有权值

加权有向图,连接既有方向性,又带有权值

图是由一组顶点和一组能够将两个顶点相连的边组成。

常见的地图,电路,网络等都是图的结构。

术语

顶点:图中的一个点

边:连接两个顶点的线段叫做边,edge

相邻的:一个边的两头的顶点称为是相邻的顶点

度数:由一个顶点出发,有几条边就称该顶点有几度,或者该顶点的度数是几,degree

路径:通过边来连接,按顺序的从一个顶点到另一个顶点中间经过的顶点集合

简单路径:没有重复顶点的路径

环:至少含有一条边,并且起点和终点都是同一个顶点的路径

简单环:不含有重复顶点和边的环

连通的:当从一个顶点出发可以通过至少一条边到达另一个顶点,我们就说这两个顶点是连通的

连通图:如果一个图中,从任意顶点均存在一条边可以到达另一个任意顶点,我们就说这个图是个连通图

无环图:是一种不包含环的图

稀疏图:图中每个顶点的度数都不是很高,看起来很稀疏

稠密图:图中的每个顶点的度数都很高,看起来很稠密

二分图:可以将图中所有顶点分为两部分的图

所以树其实就是一种无环连通图。

有向图

有向图是一幅有方向性的图,由一组顶点和有向边组成。所以,大白话来讲,有向图是包括箭头来代表方向的。

常见的例如食物链,网络通信等都是有向图的结构。

术语

上面我们介绍了顶点的度数,在有向图中,顶点被细分为了:

出度:由一个顶点出发的边的总数

入度:指向一个顶点的边的总数

接着,由于有向图的方向性,一条边的出发点称为头,指向点称为尾。

有向路径:图中的一组顶点可以满足从其中任意一个顶点出发,都存在一条有向边指向这组顶点中的另一个。

有向环:至少含有一条边的起点和终点都是同一个顶点的一条有向路径。

简单有向环:一条不含有重复顶点和边的环。

路径或环的长度就是他们包含的边数。

图的连通性在有向图中表现为可达性,由于边的方向性,可达性必须是通过顶点出发的边的正确方向,与另一个顶点可连通。

邻接表数组

可表示图的数据类型,意思就是如何通过一个具体的文件内容,来表示出一幅图的所有顶点,以及顶点间的边。

邻接表数组,以顶点为索引(注意顶点没有权值,只有顺序,因此是从0开始的顺序值),其中每个元素都是和该顶点相邻的顶点列表。

5 vertices, 3 edges

0: 4 1

1: 0

2:

3:

4:

背包

做一个背包集合,用来存储与一个顶点连通的顶点集合,因为不在意存储顺序,并且只进不出,所以选择背包结构来存储。温习一下背包

package algorithms.bag;

import java.util.Iterator;

// 定义一个背包集合,支持泛型,支持迭代

public class Bag implements Iterable {

private class BagNode {

Item item;

BagNode next;

}

BagNode head;

int size;

@Override

public Iterator iterator() {

return new Iterator() {

BagNode node = head;

@Override

public boolean hasNext() {

return node.next != null;

}

@Override

public Item next() {

Item item = (Item) node.item;

node = node.next;

return item;

}

};

}

public Bag() {

head = new BagNode();

size = 0;

}

// 往前插入

public void add(Item item) {

BagNode temp = new BagNode();

// 以下两行代码一定要声明,不可直接使用temp = head,那样temp赋值的是head的引用,对head的所有修改会直接同步到temp,temp就不具备缓存的功能,引发bug。。

temp.next = head.next;

temp.item = head.item;

head.item = item;

head.next = temp;

size++;

}

public boolean isEmpty() {

return size == 0;

}

public int size() {

return this.size;

}

public static void main(String[] args) {

Bag bags = new Bag();

bags.add("hello");

bags.add("yeah");

bags.add("liu wen bin");

bags.add("seminar");

bags.add("1243");

System.out.println(bags.size);

// for (Iterator i = bags.iterator(); i.hasNext(); ) {

// System.out.println(i.next());

// }

// 由于Bag实现了Iterable接口,所以支持以下方式遍历

for (String a : bags) {

System.out.println(a);

}

}

}

有向图结构

下面代码实现一个有向图数据结构,并添加常用有向图属性和功能。

package algorithms.graph;

import algorithms.bag.Bag;

import ioutil.In;

import ioutil.StdOut;

import java.io.FileReader;

public class Digraph {

private final int V;// 顶点总数,定义final,第一次初始化以后不可更改。

private int E;// 边总数

private Bag[] adj;// {邻接表}顶点为数组下标,值为当前下标为顶点值所连通的顶点个数。

public Digraph(int v) {

this.V = v;

this.E = 0;

adj = new Bag[V];

for (int i = 0; i < V; i++) {

adj[i] = new Bag();

}

}

public Digraph(In in) {

this(in.readInt());

int E = in.readInt();

for (int i = 0; i < E; i++) {

int v = in.readInt();

int w = in.readInt();

addEdge(v, w);

}

}

public int V() {

return this.V;

}

public int E() {

return this.E;

}

/**

* v和w是两个顶点,中间加一条边,增加稠密度。

*

* @param v 大V是顶点总数,v是顶点值,所以并v不存在大小限制

* @param w 同上。

*/

public void addEdge(int v, int w) {

adj[v].add(w);

E++;

}

/**

* 返回一个顶点的连通顶点集合的迭代器

*

* @param v

* @return Bag本身就是迭代器,所以返回该顶点的连通顶点集合Bag即可。

*/

public Iterable adj(int v) {

return adj[v];

}

/**

* 将图中所有方向反转

*

* @return 返回一个图将所有方向反转后的副本

*/

public Digraph reverse() {

Digraph R = new Digraph(V);

for (int v = 0; v < V; v++) {

for (int w : adj[v]) {// 遍历原图中跟v顶点连通的顶点w。

R.addEdge(w, v);

}

}

return R;

}

/**

* 按照邻接表数组结构输出有向图内容

*

* @return

*/

public String toString() {

String s = V + " vertices, " + E + " edges\n";

for (int v = 0; v < V; v++) {

s += v + ": ";

for (int w : this.adj(v)) {

s += w + " ";

}

s += "\n";

}

return s;

}

public static void main(String[] args) {

Digraph d = new Digraph(5);

d.addEdge(0, 1);

d.addEdge(1, 0);

d.addEdge(2, 3);

d.addEdge(0, 4);

StdOut.println(d);

/**

输出:

5 vertices, 3 edges

0: 4 1

1: 0

2:

3:

4:

*/

}

}

以上背包和有向图代码相关解释请具体参照代码中注释。

可达性

上面提到了有向图中的可达性和图中的连通性的关系,可达性是连通性的特殊形式,对方向敏感,所以提到有向图,不可不研究可达性。

可达性解答了“从一个顶点v到达另一个顶点w,是否存在一条有向路径”等类似问题。

深度优先搜索

解答可达性问题,要借助深度优先搜索算法。为了更好的理解深度优先算法,先来搞清楚如何完全探索一个迷宫。

Tremaux搜索

完全探索一个迷宫的规则是:从起点出发,不走重复路线,走到终点走出迷宫。具体流程:

每当第一次到达一个新的顶点或边时,标记上。

在走的过程中,遇到一个已标记的顶点或边时,退回到上一个顶点。

当回退到的顶点已没有可走的边时继续回退。

我想Tremaux搜索会给我们带来一些启发,回到图的深度优先搜索算法。

package algorithms.graph;

import algorithms.bag.Bag;

import ioutil.StdOut;

/**

* 基于深度优先搜索(Depth First Search)解答有向图顶点可达性问题。

*/

public class DigraphDFS {

private boolean[] marked;// 是否标记过

/**

* 算法:在图中找到从某个顶点出发的所有顶点

*

* @param digraph

* @param start

*/

public DigraphDFS(Digraph digraph, int start) {

marked = new boolean[digraph.V()];// 初始化marked数组

dfs(digraph, start);

}

/**

* 算法:在图中找到从某些顶点出发的所有顶点,这些顶点被作为一个集合传入。

*

* @param digraph

* @param startSet

*/

public DigraphDFS(Digraph digraph, Iterable startSet) {

marked = new boolean[digraph.V()];

for (int w : startSet) {

dfs(digraph, w);

}

}

/**

* 查询某个顶点是否被标记(是否可达,因为标记过就是可达的)

*

* @param v

* @return

*/

public boolean marked(int v) {

return marked[v];

}

/**

* 深度优先搜索核心算法,通过标记,在图中从v顶点出发找到有效路径

*

* 返回的是通过标记形成的一条有效路径。

*

* @param digraph

* @param v

*/

private void dfs(Digraph digraph, int v) {

marked[v] = true;// 标记起点可达。

for (int w : digraph.adj(v)) {// 遍历v顶点可达的一级顶点。

if (!marked[w]) dfs(digraph, w);// 如果发现w顶点未到达过,则继续从w开始dfs(即向前走了一步)

}

}

public static void main(String[] args) {

Digraph d = new Digraph(5);// 初始化五个顶点的图

d.addEdge(0, 1);

d.addEdge(1, 0);

d.addEdge(2, 3);

d.addEdge(0, 4);

Bag startSet = new Bag<>();

startSet.add(2);

DigraphDFS reachable = new DigraphDFS(d, startSet);

for (int v = 0; v < d.V(); v++) {

if (reachable.marked(v)) {

StdOut.print(v + " ");

}

StdOut.println();

}

/**

* 输出:

*

2

3

*/

}

}

startSet是入参条件,只有一个值为2,即在图中找寻2的有效路径,通过图中的边我们可以看出,2的有效路径只有3,所以输出是正确的。

可达性的一种应用:垃圾收集

我们都知道一般的对象垃圾收集都是计算它的引用数。在图结构中,把对象作为顶点,引用作为边,当一个对象在一段时间内未被他人引用的时候,这个顶点就是孤立的,对于其他有效路径上的顶点来说它就是不可达的,因此就不会被标记,这时候,例如JVM就会清除掉这些对象释放内存,所以JVM也是一直在跑类似以上这种DFS的程序,不断找到那些未被标记的顶点,按照一定时间规则进行清除。

有向无环图

不包含有向环的有向图就是有向无环图,DAG,Directed Acyclic Graph。

上面我们循序渐进的介绍了图,有向图,本节开始介绍有向无环图,概念也已经给出,可以看出有向无环图是有向图的一种特殊结构。那么第一个问题就是

如何监测有向图中没有有向环,也就是如何确定一个DAG。

寻找有向环

基于上面的问题,我们要做一个寻找有向环的程序,这个程序还是依赖DFS深度优先搜索算法,如果找不到,则说明这个有向图是DAG。

先来补个坑,其实前面包括背包我在之前都写过,但因为前面那篇文章是我第一篇博文,我还太稚嫩,没有掌握好的编辑器,也没有粘贴代码,所以这里有必要重新填坑。

package algorithms.stack;

import ioutil.StdOut;

import java.util.Iterator;

import java.util.NoSuchElementException;

public class Stack implements Iterable {

private int SIZE;

private Node first;// 栈顶

public Stack() {// 初始化成员变量

SIZE = 0;

first = null;

}

private class Node {

private Item item;

private Node next;

}

// 栈:往first位置插入新元素

public void push(Item item) {

Node temp = first;

first = new Node();

first.item = item;

first.next = temp;

SIZE++;

}

// 栈:从first位置取出新元素,满足LIFO,后进先出。

public Item pop() {

if (isEmpty()) throw new RuntimeException("Stack underflow");

Item item = first.item;

first = first.next;

SIZE--;

return item;

}

public boolean isEmpty() {

return first == null;

}

public int size() {

return this.SIZE;

}

@Override

public Iterator iterator() {

return new Iterator() {

Node node = first;

@Override

public boolean hasNext() {

return first != null;

}

@Override

public Item next() {

if (!hasNext()) throw new NoSuchElementException();

Item item = node.item;

node = node.next;

return item;

}

};

}

public static void main(String[] args){

Stack stack = new Stack<>();

stack.push("heyheyhey");

stack.push("howau");

stack.push("231");

StdOut.println(stack.SIZE);

StdOut.println(stack.pop());

}

}

我们要做寻找有向环的程序的话,要依赖栈的结构,所以上面把这个坑给填了,下面回归到寻找有向环的程序。(当然,你也可以直接使用java.util.Stack类)

package algorithms.graph;

import ioutil.StdOut;

import java.util.Stack;

public class DirectedCycle {

private boolean[] marked;// 以顶点为索引,值代表了该顶点是否标记过(是否可达)

private Stack cycle; // 用来存储有向环顶点。

// *****重点理解这里start****

private int[] edgeTo;// edgeTo[0]=1代表顶点1->0, to 0的顶点为1。

// *****重点理解这里end****

private boolean[] onStack;// 顶点为索引,值为该顶点是否参与dfs递归,参与为true

public DirectedCycle(Digraph digraph) {

// 初始化成员变量

marked = new boolean[digraph.V()];

onStack = new boolean[digraph.V()];

edgeTo = new int[digraph.V()];

cycle = null;

// 检查是否有环

for (int v = 0; v < digraph.V(); v++) {

dfs(digraph, v);

}

}

private void dfs(Digraph digraph, int v) {

onStack[v] = true;// 递归开始,顶点上栈

marked[v] = true;

for (int w : digraph.adj(v)) {// 遍历一条边,v-> w

// 终止条件:找到有向环

if (hasCycle()) return;

// 使用onStack标志位来记录有效路径上的点,如果w在栈上,说明w在前面当了出发点,

if (!marked[w]) {

edgeTo[w] = v;// to w的顶点为v

dfs(digraph, w);

} else if (onStack[w]) {// 如果指到了已标记的顶点,且该顶点递归栈上。(栈上都是出发点,而找到了已标记的顶点是终点,说明出发点和终点相同了。)

cycle = new Stack();

for (int x = v; x != w; x = edgeTo[x]) {//起点在第一次循环中已经push了,不要重复

cycle.push(x);// 将由v出发,w结束的环上中间的结点遍历push到cycle中。

}

cycle.push(w);// push终点

}

}

onStack[v] = false;// 当递归开始结算退出时,顶点下栈。

}

public boolean hasCycle() {

return cycle != null;

}

public Iterable cycle() {

return cycle;

}

public static void main(String[] args) {

Digraph d = new Digraph(6);

d.addEdge(0, 1);

d.addEdge(1, 2);

d.addEdge(2, 3);

d.addEdge(3, 0);

DirectedCycle directedCycle = new DirectedCycle(d);

if (directedCycle.hasCycle()) {

for (int a : directedCycle.cycle()) {

StdOut.println(a);

}

} else {

StdOut.println("DAG");

}

}

}

这段代码不长但其中算法比较复杂,我尽力在注释中做了详细解释,如有任何不明之处,欢迎随时留言给我。

以上程序的测试用图为

6 vertices, 4 edges

0: 1

1: 2

2: 3

3: 0

4:

5:

肉眼可以看出,这是一个0-1-2-3-0的一个有向环,所以以上程序的执行结果为:

3

2

1

0

先入栈的在后面,可以看出是0-1-2-3的有向环结构。如果我们将图的内容改为:

6 vertices, 4 edges

0: 1

1: 2

2: 3

3:

4:

5: 0

则明显最后一个拼图3-0被我们打破了,变成了无所谓的5-0,这时该有向图就不存在有向环。此时以上程序执行结果为:

DAG

DAG与BlockChain

上面一章节我们将DAG深挖了挖,我想到这里您已经和我一样对DAG的算法层面非常了解,那么它和如今沸沸扬扬的区块链有什么关联呢?本章节主要介绍这部分内容。

在前面的文章中,我们已经了解了区块链技术,无论是比特币还是以太坊,都是基于一条链式结构,实现了去中心化的,点对点的,trustless的一种新型技术。然而这条链式结构在面临业务拓展的时候屡屡遭受新的挑战,例如块存储量问题,交易速度问题,数据总量过大,单节点存储压力等等。而DAG是基于图的一种实现方式,之所以不允许有向环的出现,是因为DAG可以保证结点交易的顺序,可以通过上面介绍过的有效路径来找到那根主链。如果出现了有向环,那系统就乱了。如果没有有向环的话,DAG中可以有多条有效路径连接各个顶点,因此DAG可以说是更加完善,强大的新一代区块链结构。

目前非常有名的采用DAG技术的区块链产品有DagCoin,IOTA,ByteBall等,他们都是基于DAG,在性能和储量上面有了全面的提升。

这里面仍然会有“分叉”的可能,处理方式也是相同的,看哪个结点能够有新的后续,这个部分我们在讲“叔块”的时候说过。

区块链采用DAG结构以后称为了blockless,无块化的结构,即我们不再将交易打包到块中,以块为单元进行存储,而是直接将交易本身作为基本单元进行存储。另外,DAG也有双花的可能,也是上面“分叉问题”引起的,但它在确认有效路径以后会自动恢复。同时,DAG是异步共识,具体机制还不了解,但它解决了交易性能问题。

总结

本文循序渐进地从图到有向图到有向无环图,详细地介绍了相关术语,api代码实现,也补充入了背包和栈的代码实现,重点研究了图的深度优先搜索算法以及寻找有向环算法。最后对DAG和区块链的关系进行了简介,希望随着技术发展,DAG有望成为真正的区块链3.0。

参考资料

Algorithms 4th,网上资料

更多文章请转到醒者呆的博客园。

posted @

2018-03-14 17:46 

一面千人 

阅读(61302) 

评论(3) 

编辑 

收藏 

举报

会员力量,点亮园子希望

刷新页面返回顶部

公告

Copyright © 2024 一面千人

Powered by .NET 8.0 on Kubernetes

有向无环图 - OI Wiki

图 - OI Wiki 跳转至 OI Wiki 有向无环图 正在初始化搜索引擎 OI-wiki/OI-wiki 简介 比赛相关 工具软件 语言基础 算法基础 搜索 动态规划 字符串 数学 数据结构 图论 计算几何 杂项 专题 关于 Hulu OI Wiki OI-wiki/OI-wiki 简介 简介 Getting Started 关于本项目 如何参与 OI Wiki 不是什么 格式手册 数学符号表 F.A.Q. 用 Docker 部署 OI Wiki 镜像站列表 致谢 比赛相关 比赛相关 比赛相关简介 赛事 赛事 OI 赛事与赛制 ICPC/CCPC 赛事与赛制 题型 题型 题型概述 交互题 学习路线 学习资源 技巧 技巧 读入、输出优化 分段打表 常见错误 常见技巧 出题 工具软件 工具软件 工具软件简介 代码编辑工具 代码编辑工具 Vim Emacs VS Code Atom Eclipse Notepad++ Kate Dev-C++ CLion Geany Xcode GUIDE Sublime Text CP Editor 评测工具 评测工具 评测工具简介 Arbiter Cena CCR Plus Lemon 命令行 编译器 WSL (Windows 10) Special Judge Testlib Testlib Testlib 简介 通用 Generator Validator Interactor Checker Polygon OJ 工具 LaTeX 入门 Git 语言基础 语言基础 语言基础简介 C++ 基础 C++ 基础 Hello, World! C++ 语法基础 变量 运算 流程控制语句 流程控制语句 分支 循环 高级数据类型 高级数据类型 数组 结构体 联合体 指针 函数 文件操作 C++ 标准库 C++ 标准库 C++ 标准库简介 STL 容器 STL 容器 STL 容器简介 迭代器 序列式容器 关联式容器 无序关联式容器 容器适配器 STL 算法 bitset string pair C++ 进阶 C++ 进阶 类 命名空间 值类别 重载运算符 引用 常值 新版 C++ 特性 Lambda 表达式 pb_ds pb_ds pb_ds 简介 堆 平衡树 编译优化 C++ 与其他常用语言的区别 Pascal 转 C++ 急救 Python 速成 Java 速成 Java 进阶 算法基础 算法基础 算法基础简介 复杂度 枚举 模拟 递归 & 分治 贪心 排序 排序 排序简介 选择排序 冒泡排序 插入排序 计数排序 基数排序 快速排序 归并排序 堆排序 桶排序 希尔排序 锦标赛排序 tim排序 排序相关 STL 排序应用 前缀和 & 差分 二分 倍增 构造 搜索 搜索 搜索部分简介 DFS(搜索) BFS(搜索) 双向搜索 启发式搜索 A* 迭代加深搜索 IDA* 回溯法 Dancing Links Alpha-Beta 剪枝 优化 动态规划 动态规划 动态规划部分简介 动态规划基础 记忆化搜索 背包 DP 区间 DP DAG 上的 DP 树形 DP 状压 DP 数位 DP 插头 DP 计数 DP 动态 DP 概率 DP DP 优化 DP 优化 单调队列/单调栈优化 斜率优化 四边形不等式优化 状态设计优化 其它 DP 方法 字符串 字符串 字符串部分简介 字符串基础 标准库 字符串匹配 字符串哈希 字典树 (Trie) 前缀函数与 KMP 算法 Boyer–Moore 算法 Z 函数(扩展 KMP) 自动机 AC 自动机 后缀数组 (SA) 后缀数组 (SA) 后缀数组简介 最优原地后缀排序算法 后缀自动机 (SAM) 后缀平衡树 广义后缀自动机 后缀树 Manacher 回文树 序列自动机 最小表示法 Lyndon 分解 Main–Lorentz 算法 数学 数学 数学部分简介 符号 进位制 位运算 二进制集合操作 平衡三进制 高精度计算 快速幂 置换和排列 弧度制与坐标系 复数 数论 数论 数论基础 素数 最大公约数 数论分块 欧拉函数 筛法 Meissel–Lehmer 算法 分解质因数 裴蜀定理 类欧几里德算法 欧拉定理 & 费马小定理 乘法逆元 线性同余方程 中国剩余定理 升幂引理 威尔逊定理 卢卡斯定理 同余方程 二次剩余 原根 离散对数 剩余 莫比乌斯反演 杜教筛 Powerful Number 筛 Min_25 筛 洲阁筛 连分数 Stern–Brocot 树与 Farey 序列 二次域 循环连分数 Pell 方程 多项式与生成函数 多项式与生成函数 多项式与生成函数简介 代数基本定理 快速傅里叶变换 快速数论变换 快速沃尔什变换 Chirp Z 变换 多项式牛顿迭代 多项式多点求值|快速插值 多项式初等函数 常系数齐次线性递推 多项式平移|连续点值平移 符号化方法 普通生成函数 指数生成函数 狄利克雷生成函数 组合数学 组合数学 排列组合 抽屉原理 容斥原理 康托展开 斐波那契数列 错位排列 卡特兰数 斯特林数 贝尔数 伯努利数 Entringer Number Eulerian Number 分拆数 范德蒙德卷积 图论计数 线性代数 线性代数 线性代数简介 向量 内积和外积 矩阵 初等变换 行列式 线性空间 线性基 线性映射 特征多项式 对角化 Jordan标准型 线性规划 线性规划 线性规划简介 单纯形算法 群论 群论 群论简介 置换群 概率论 概率论 基本概念 条件概率与独立性 随机变量 随机变量的数字特征 概率不等式 博弈论 博弈论 博弈论简介 公平组合游戏 非公平组合游戏 反常游戏 数值算法 数值算法 插值 数值积分 高斯消元 牛顿迭代法 傅里叶-莫茨金消元法 序理论 杨氏矩阵 Schreier–Sims 算法 Berlekamp–Massey 算法 数据结构 数据结构 数据结构部分简介 栈 队列 链表 哈希表 并查集 并查集 并查集 并查集复杂度 堆 堆 堆简介 二叉堆 配对堆 左偏树 块状数据结构 块状数据结构 分块思想 块状数组 块状链表 树分块 Sqrt Tree 单调栈 单调队列 ST 表 树状数组 线段树 李超线段树 区间最值操作 & 区间历史最值 划分树 二叉搜索树 & 平衡树 二叉搜索树 & 平衡树 二叉搜索树 & 平衡树 Treap Splay 树 WBLT Size Balanced Tree AVL 树 B 树 B+ 树 替罪羊树 Leafy Tree 笛卡尔树 红黑树 左偏红黑树 AA 树 2-3 树 2-3-4 树 跳表 可持久化数据结构 可持久化数据结构 可持久化数据结构简介 可持久化线段树 可持久化块状数组 可持久化平衡树 可持久化字典树 可持久化可并堆 树套树 树套树 线段树套线段树 平衡树套线段树 线段树套平衡树 树状数组套权值线段树 分块套树状数组 K-D Tree 动态树 动态树 Link Cut Tree 全局平衡二叉树 Euler Tour Tree Top Tree 析合树 PQ 树 手指树 霍夫曼树 图论 图论 图论部分简介 图论相关概念 图的存储 DFS(图论) BFS(图论) 树上问题 树上问题 树基础 树的直径 最近公共祖先 树的重心 树链剖分 树上启发式合并 虚树 树分治 动态树分治 AHU 算法 树哈希 树上随机游走 矩阵树定理 有向无环图 有向无环图 目录 定义 性质 判定 应用 DP 求最长(短)路 拓扑排序 最小生成树 斯坦纳树 最小树形图 最小直径生成树 最短路 拆点 差分约束 k 短路 同余最短路 连通性相关 连通性相关 强连通分量 双连通分量 割点和桥 圆方树 点/边连通度 环计数问题 2-SAT 欧拉图 哈密顿图 二分图 最小环 平面图 图的着色 网络流 网络流 网络流简介 最大流 最小割 费用流 上下界网络流 Stoer–Wagner 算法 图的匹配 图的匹配 图匹配 增广路 二分图最大匹配 二分图最大权匹配 一般图最大匹配 一般图最大权匹配 Prüfer 序列 LGV 引理 弦图 最大团搜索算法 支配树 图上随机游走 计算几何 计算几何 计算几何部分简介 二维计算几何基础 三维计算几何基础 距离 Pick 定理 三角剖分 凸包 扫描线 旋转卡壳 半平面交 平面最近点对 随机增量法 反演变换 计算几何杂项 杂项 杂项 杂项简介 离散化 双指针 离线算法 离线算法 离线算法简介 CDQ 分治 整体二分 莫队算法 莫队算法 莫队算法简介 普通莫队算法 带修改莫队 树上莫队 回滚莫队 二维莫队 莫队二次离线 莫队配合 bitset 分数规划 随机化 随机化 随机函数 随机化技巧 爬山算法 模拟退火 悬线法 计算理论基础 字节顺序 约瑟夫问题 格雷码 表达式求值 在一台机器上规划任务 主元素问题 Garsia–Wachs 算法 15-puzzle Kahan 求和 珂朵莉树/颜色段均摊 专题 专题 RMQ 并查集应用 括号序列 线段树与离线询问 关于 Hulu 关于 Hulu 关于 Hulu 目录 定义 性质 判定 应用 DP 求最长(短)路 有向无环图定义边有向,无环。英文名叫 Directed Acyclic Graph,缩写是 DAG。性质能 拓扑排序 的图,一定是有向无环图; 如果有环,那么环上的任意两个节点在任意序列中都不满足条件了。有向无环图,一定能拓扑排序; (归纳法)假设节点数不超过 的 有向无环图都能拓扑排序,那么对于节点数等于 的,考虑执行拓扑排序第一步之后的情形即可。判定如何判定一个图是否是有向无环图呢?检验它是否可以进行 拓扑排序 即可。当然也有另外的方法,可以对图进行一遍 DFS,在得到的 DFS 树上看看有没有连向祖先的非树边(返祖边)。如果有的话,那就有环了。应用DP 求最长(短)路在一般图上,求单源最长(短)路径的最优时间复杂度为 (Bellman–Ford 算法,适用于有负权图)或 (Dijkstra 算法,适用于无负权图)。但在 DAG 上,我们可以使用 DP 求最长(短)路,使时间复杂度优化到 。状态转移方程为 或 。拓扑排序后,按照拓扑序遍历每个节点,用当前节点来更新之后的节点。 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44struct edge {

int v, w;

};

int n, m;

vector e[MAXN];

vector L; // 存储拓扑排序结果

int max_dis[MAXN], min_dis[MAXN], in[MAXN]; // in 存储每个节点的入度

void toposort() { // 拓扑排序

queue S;

memset(in, 0, sizeof(in));

for (int i = 1; i <= n; i++) {

for (int j = 0; j < e[i].size(); j++) {

in[e[i][j].v]++;

}

}

for (int i = 1; i <= n; i++)

if (in[i] == 0) S.push(i);

while (!S.empty()) {

int u = S.front();

S.pop();

L.push_back(u);

for (int i = 0; i < e[u].size(); i++) {

if (--in[e[u][i].v] == 0) {

S.push(e[u][i].v);

}

}

}

}

void dp(int s) { // 以 s 为起点求单源最长(短)路

toposort(); // 先进行拓扑排序

memset(min_dis, 0x3f, sizeof(min_dis));

memset(max_dis, 0, sizeof(max_dis));

min_dis[s] = 0;

for (int i = 0; i < L.size(); i++) {

int u = L[i];

for (int j = 0; j < e[u].size(); j++) {

min_dis[e[u][j].v] = min(min_dis[e[u][j].v], min_dis[u] + e[u][j].w);

max_dis[e[u][j].v] = max(max_dis[e[u][j].v], max_dis[u] + e[u][j].w);

}

}

}

参见:DAG 上的 DP。本页面最近更新:2023/6/27 13:39:08,更新历史发现错误?想一起完善? 在 GitHub 上编辑此页!本页面贡献者:ouuan, billchenchina, dong628, Enter-tainer, HeRaNO, Ir1d, Tiphereth-A本页面的全部内容在 CC BY-SA 4.0 和 SATA 协议之条款下提供,附加条款亦可能应用Copyright © 2016 - 2024 OI Wiki Team Made with Material for MkDocs 最近更新:0a6d002b, 2024-03-

为什么很多分布式系统都是以DAG(Directed acyclic graph )实现运算的? - 知乎

为什么很多分布式系统都是以DAG(Directed acyclic graph )实现运算的? - 知乎首页知乎知学堂发现等你来答​切换模式登录/注册分布式系统为什么很多分布式系统都是以DAG(Directed acyclic graph )实现运算的?比如Tez, 比如Spark, 还有一大堆实验室里产生的framework 和system。 我能想到的和DAG并列的只有Flink的stream运算…显示全部 ​关注者610被浏览67,605关注问题​写回答​邀请回答​好问题 5​2 条评论​分享​17 个回答默认排序木鸟杂记​专注大规模数据系统,公众号:木鸟杂记,欢迎交流​ 关注因为“图”是表达“关系”的天然数据结构,而执行单元间的依赖(通常是通过输入输出产生的依赖)就是一种关系。那为什么是“有向无环图”呢?如果再复杂一些,有了环,图没有办法被“拓扑排序”,也就没有办法被调度。简单想象一下,如果执行逻辑是环的话,你从哪里开始执行呢?当然,你说可以从某个点开始一轮一轮迭代式执行,但这本质上也是通过拆成多个迭代轮次,将其转化为了无环图。如果再简单一些,退化成树,如 SQL 引擎中的计划树(plan tree)。但在分布式系统中,我们单个逻辑上的数据集(如 table、rdd)通常以数据分片的形式分散在多机上。这样在涉及到多机的 Shuffle 算子时(比如 sort、groupBy、join),逻辑上是树的执行计划在物理上实际计算拓扑其实就变成了图:一个上游为多个下游所依赖;一个下游为多个上游所生成。如果你觉得我的文章不错,欢迎订阅我的分布式系统和数据库专栏支持我:编辑于 2024-02-08 10:47​赞同 44​​添加评论​分享​收藏​喜欢收起​知乎用户​不是说分布式系统的计算要用 DAG 来实现,而是数据的处理流程本来就是这样的,我们抽象了一下,发现可以他总是可以描述为一个 DAG。早期只有 MapReduce 的时候,Hive 就是把一个计算任务分解成若干级联的 Map-Reduce ,后来大家发现这种模型太低效,把没有的分支去掉,最后又变成 DAG 了。因此也会说 MapReduce 是 DAG 的一种特殊形式。至于 Streaming ,跟 DAG 不是一个领域的概念。Streaming 更多的描述的是数据源的输入方式,是一种批处理模型还是流,后面真正计算的逻辑还是一个 DAG 。不信你看Spark Streaming编辑于 2018-01-08 23:14​赞同 56​​3 条评论​分享​收藏​喜欢

DAGs — Airflow Documentation

DAGs — Airflow Documentation

Community

Meetups

Documentation

Use-cases

Announcements

Blog

Ecosystem

Community

Meetups

Documentation

Use-cases

Announcements

Blog

Ecosystem

Content

Version: 2.8.2

Content

Overview

Quick Start

Installation of Airflow™

Security

Tutorials

How-to Guides

UI / Screenshots

Core Concepts

Architecture Overview

DAGs

Declaring a DAG

Task Dependencies

Loading DAGs

Running DAGs

DAG Assignment

Default Arguments

The DAG decorator

Control Flow

Branching

Latest Only

Depends On Past

Trigger Rules

Setup and teardown

Dynamic DAGs

DAG Visualization

TaskGroups

Edge Labels

DAG & Task Documentation

SubDAGs

TaskGroups vs SubDAGs

Packaging DAGs

.airflowignore

DAG Dependencies

DAG pausing, deactivation and deletion

DAG Runs

Tasks

Operators

Sensors

TaskFlow

Executor

Object Storage

XComs

Variables

Params

Authoring and Scheduling

Administration and Deployment

Integration

Public Interface of Airflow

Best Practices

FAQ

Release Policies

Release Notes

Privacy Notice

Project

License

References

Operators and hooks

CLI

Templates

Stable REST API

Deprecated REST API

Configurations

Extra packages

Internal DB details

Database Migrations

Database ERD Schema

Version: 2.8.2

Content

Overview

Quick Start

Installation of Airflow™

Security

Tutorials

How-to Guides

UI / Screenshots

Core Concepts

Architecture Overview

DAGs

Declaring a DAG

Task Dependencies

Loading DAGs

Running DAGs

DAG Assignment

Default Arguments

The DAG decorator

Control Flow

Branching

Latest Only

Depends On Past

Trigger Rules

Setup and teardown

Dynamic DAGs

DAG Visualization

TaskGroups

Edge Labels

DAG & Task Documentation

SubDAGs

TaskGroups vs SubDAGs

Packaging DAGs

.airflowignore

DAG Dependencies

DAG pausing, deactivation and deletion

DAG Runs

Tasks

Operators

Sensors

TaskFlow

Executor

Object Storage

XComs

Variables

Params

Authoring and Scheduling

Administration and Deployment

Integration

Public Interface of Airflow

Best Practices

FAQ

Release Policies

Release Notes

Privacy Notice

Project

License

References

Operators and hooks

CLI

Templates

Stable REST API

Deprecated REST API

Configurations

Extra packages

Internal DB details

Database Migrations

Database ERD Schema

Home

Core Concepts

DAGs

DAGs¶

A DAG (Directed Acyclic Graph) is the core concept of Airflow, collecting Tasks together, organized with dependencies and relationships to say how they should run.

Here’s a basic example DAG:

It defines four Tasks - A, B, C, and D - and dictates the order in which they have to run, and which tasks depend on what others. It will also say how often to run the DAG - maybe “every 5 minutes starting tomorrow”, or “every day since January 1st, 2020”.

The DAG itself doesn’t care about what is happening inside the tasks; it is merely concerned with how to execute them - the order to run them in, how many times to retry them, if they have timeouts, and so on.

Declaring a DAG¶

There are three ways to declare a DAG - either you can use a context manager,

which will add the DAG to anything inside it implicitly:

import datetime

from airflow import DAG

from airflow.operators.empty import EmptyOperator

with DAG(

dag_id="my_dag_name",

start_date=datetime.datetime(2021, 1, 1),

schedule="@daily",

):

EmptyOperator(task_id="task")

Or, you can use a standard constructor, passing the DAG into any operators you use:

import datetime

from airflow import DAG

from airflow.operators.empty import EmptyOperator

my_dag = DAG(

dag_id="my_dag_name",

start_date=datetime.datetime(2021, 1, 1),

schedule="@daily",

)

EmptyOperator(task_id="task", dag=my_dag)

Or, you can use the @dag decorator to turn a function into a DAG generator:

import datetime

from airflow.decorators import dag

from airflow.operators.empty import EmptyOperator

@dag(start_date=datetime.datetime(2021, 1, 1), schedule="@daily")

def generate_dag():

EmptyOperator(task_id="task")

generate_dag()

DAGs are nothing without Tasks to run, and those will usually come in the form of either Operators, Sensors or TaskFlow.

Task Dependencies¶

A Task/Operator does not usually live alone; it has dependencies on other tasks (those upstream of it), and other tasks depend on it (those downstream of it). Declaring these dependencies between tasks is what makes up the DAG structure (the edges of the directed acyclic graph).

There are two main ways to declare individual task dependencies. The recommended one is to use the >> and << operators:

first_task >> [second_task, third_task]

third_task << fourth_task

Or, you can also use the more explicit set_upstream and set_downstream methods:

first_task.set_downstream([second_task, third_task])

third_task.set_upstream(fourth_task)

There are also shortcuts to declaring more complex dependencies. If you want to make two lists of tasks depend on all parts of each other, you can’t use either of the approaches above, so you need to use cross_downstream:

from airflow.models.baseoperator import cross_downstream

# Replaces

# [op1, op2] >> op3

# [op1, op2] >> op4

cross_downstream([op1, op2], [op3, op4])

And if you want to chain together dependencies, you can use chain:

from airflow.models.baseoperator import chain

# Replaces op1 >> op2 >> op3 >> op4

chain(op1, op2, op3, op4)

# You can also do it dynamically

chain(*[EmptyOperator(task_id='op' + i) for i in range(1, 6)])

Chain can also do pairwise dependencies for lists the same size (this is different from the cross dependencies created by cross_downstream!):

from airflow.models.baseoperator import chain

# Replaces

# op1 >> op2 >> op4 >> op6

# op1 >> op3 >> op5 >> op6

chain(op1, [op2, op3], [op4, op5], op6)

Loading DAGs¶

Airflow loads DAGs from Python source files, which it looks for inside its configured DAG_FOLDER. It will take each file, execute it, and then load any DAG objects from that file.

This means you can define multiple DAGs per Python file, or even spread one very complex DAG across multiple Python files using imports.

Note, though, that when Airflow comes to load DAGs from a Python file, it will only pull any objects at the top level that are a DAG instance. For example, take this DAG file:

dag_1 = DAG('this_dag_will_be_discovered')

def my_function():

dag_2 = DAG('but_this_dag_will_not')

my_function()

While both DAG constructors get called when the file is accessed, only dag_1 is at the top level (in the globals()), and so only it is added to Airflow. dag_2 is not loaded.

Note

When searching for DAGs inside the DAG_FOLDER, Airflow only considers Python files that contain the strings airflow and dag (case-insensitively) as an optimization.

To consider all Python files instead, disable the DAG_DISCOVERY_SAFE_MODE configuration flag.

You can also provide an .airflowignore file inside your DAG_FOLDER, or any of its subfolders, which describes patterns of files for the loader to ignore. It covers the directory it’s in plus all subfolders underneath it. See .airflowignore below for details of the file syntax.

In the case where the .airflowignore does not meet your needs and you want a more flexible way to control if a python file needs to be parsed by Airflow. You can plug your callable by setting might_contain_dag_callable in the config file.

Note, this callable will replace the default Airflow heuristic, i.e. checking if the strings airflow and dag (case-insensitively) in the file.

def might_contain_dag(file_path: str, zip_file: zipfile.ZipFile | None = None) -> bool:

# Your logic to check if there are DAGs defined in the file_path

# Return True if the file_path needs to be parsed, otherwise False

Running DAGs¶

DAGs will run in one of two ways:

When they are triggered either manually or via the API

On a defined schedule, which is defined as part of the DAG

DAGs do not require a schedule, but it’s very common to define one. You define it via the schedule argument, like this:

with DAG("my_daily_dag", schedule="@daily"):

...

There are various valid values for the schedule argument:

with DAG("my_daily_dag", schedule="0 0 * * *"):

...

with DAG("my_one_time_dag", schedule="@once"):

...

with DAG("my_continuous_dag", schedule="@continuous"):

...

Tip

For more information different types of scheduling, see Authoring and Scheduling.

Every time you run a DAG, you are creating a new instance of that DAG which

Airflow calls a DAG Run. DAG Runs can run in parallel for the

same DAG, and each has a defined data interval, which identifies the period of

data the tasks should operate on.

As an example of why this is useful, consider writing a DAG that processes a

daily set of experimental data. It’s been rewritten, and you want to run it on

the previous 3 months of data—no problem, since Airflow can backfill the DAG

and run copies of it for every day in those previous 3 months, all at once.

Those DAG Runs will all have been started on the same actual day, but each DAG

run will have one data interval covering a single day in that 3 month period,

and that data interval is all the tasks, operators and sensors inside the DAG

look at when they run.

In much the same way a DAG instantiates into a DAG Run every time it’s run,

Tasks specified inside a DAG are also instantiated into

Task Instances along with it.

A DAG run will have a start date when it starts, and end date when it ends.

This period describes the time when the DAG actually ‘ran.’ Aside from the DAG

run’s start and end date, there is another date called logical date

(formally known as execution date), which describes the intended time a

DAG run is scheduled or triggered. The reason why this is called

logical is because of the abstract nature of it having multiple meanings,

depending on the context of the DAG run itself.

For example, if a DAG run is manually triggered by the user, its logical date would be the

date and time of which the DAG run was triggered, and the value should be equal

to DAG run’s start date. However, when the DAG is being automatically scheduled, with certain

schedule interval put in place, the logical date is going to indicate the time

at which it marks the start of the data interval, where the DAG run’s start

date would then be the logical date + scheduled interval.

Tip

For more information on logical date, see Data Interval and

What does execution_date mean?.

DAG Assignment¶

Note that every single Operator/Task must be assigned to a DAG in order to run. Airflow has several ways of calculating the DAG without you passing it explicitly:

If you declare your Operator inside a with DAG block

If you declare your Operator inside a @dag decorator

If you put your Operator upstream or downstream of an Operator that has a DAG

Otherwise, you must pass it into each Operator with dag=.

Default Arguments¶

Often, many Operators inside a DAG need the same set of default arguments (such as their retries). Rather than having to specify this individually for every Operator, you can instead pass default_args to the DAG when you create it, and it will auto-apply them to any operator tied to it:

import pendulum

with DAG(

dag_id="my_dag",

start_date=pendulum.datetime(2016, 1, 1),

schedule="@daily",

default_args={"retries": 2},

):

op = BashOperator(task_id="hello_world", bash_command="Hello World!")

print(op.retries) # 2

The DAG decorator¶

New in version 2.0.

As well as the more traditional ways of declaring a single DAG using a context manager or the DAG() constructor, you can also decorate a function with @dag to turn it into a DAG generator function:

airflow/example_dags/example_dag_decorator.py[source]

@dag(

schedule=None,

start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),

catchup=False,

tags=["example"],

)

def example_dag_decorator(email: str = "example@example.com"):

"""

DAG to send server IP to email.

:param email: Email to send IP to. Defaults to example@example.com.

"""

get_ip = GetRequestOperator(task_id="get_ip", url="http://httpbin.org/get")

@task(multiple_outputs=True)

def prepare_email(raw_json: dict[str, Any]) -> dict[str, str]:

external_ip = raw_json["origin"]

return {

"subject": f"Server connected from {external_ip}",

"body": f"Seems like today your server executing Airflow is connected from IP {external_ip}
",

}

email_info = prepare_email(get_ip.output)

EmailOperator(

task_id="send_email", to=email, subject=email_info["subject"], html_content=email_info["body"]

)

example_dag = example_dag_decorator()

As well as being a new way of making DAGs cleanly, the decorator also sets up any parameters you have in your function as DAG parameters, letting you set those parameters when triggering the DAG. You can then access the parameters from Python code, or from {{ context.params }} inside a Jinja template.

Note

Airflow will only load DAGs that appear in the top level of a DAG file. This means you cannot just declare a function with @dag - you must also call it at least once in your DAG file and assign it to a top-level object, as you can see in the example above.

Control Flow¶

By default, a DAG will only run a Task when all the Tasks it depends on are successful. There are several ways of modifying this, however:

Branching - select which Task to move onto based on a condition

Trigger Rules - set the conditions under which a DAG will run a task

Setup and Teardown - define setup and teardown relationships

Latest Only - a special form of branching that only runs on DAGs running against the present

Depends On Past - tasks can depend on themselves from a previous run

Branching¶

You can make use of branching in order to tell the DAG not to run all dependent tasks, but instead to pick and choose one or more paths to go down. This is where the @task.branch decorator come in.

The @task.branch decorator is much like @task, except that it expects the decorated function to return an ID to a task (or a list of IDs). The specified task is followed, while all other paths are skipped. It can also return None to skip all downstream tasks.

The task_id returned by the Python function has to reference a task directly downstream from the @task.branch decorated task.

Note

When a Task is downstream of both the branching operator and downstream of one or more of the selected tasks, it will not be skipped:

The paths of the branching task are branch_a, join and branch_b. Since join is a downstream task of branch_a, it will still be run, even though it was not returned as part of the branch decision.

The @task.branch can also be used with XComs allowing branching context to dynamically decide what branch to follow based on upstream tasks. For example:

@task.branch(task_id="branch_task")

def branch_func(ti=None):

xcom_value = int(ti.xcom_pull(task_ids="start_task"))

if xcom_value >= 5:

return "continue_task"

elif xcom_value >= 3:

return "stop_task"

else:

return None

start_op = BashOperator(

task_id="start_task",

bash_command="echo 5",

do_xcom_push=True,

dag=dag,

)

branch_op = branch_func()

continue_op = EmptyOperator(task_id="continue_task", dag=dag)

stop_op = EmptyOperator(task_id="stop_task", dag=dag)

start_op >> branch_op >> [continue_op, stop_op]

If you wish to implement your own operators with branching functionality, you can inherit from BaseBranchOperator, which behaves similarly to @task.branch decorator but expects you to provide an implementation of the method choose_branch.

Note

The @task.branch decorator is recommended over directly instantiating BranchPythonOperator in a DAG. The latter should generally only be subclassed to implement a custom operator.

As with the callable for @task.branch, this method can return the ID of a downstream task, or a list of task IDs, which will be run, and all others will be skipped. It can also return None to skip all downstream task:

class MyBranchOperator(BaseBranchOperator):

def choose_branch(self, context):

"""

Run an extra branch on the first day of the month

"""

if context['data_interval_start'].day == 1:

return ['daily_task_id', 'monthly_task_id']

elif context['data_interval_start'].day == 2:

return 'daily_task_id'

else:

return None

Similar like @task.branch decorator for regular Python code there are also branch decorators which use a virtual environment called @task.branch_virtualenv or external python called @task.branch_external_python.

Latest Only¶

Airflow’s DAG Runs are often run for a date that is not the same as the current date - for example, running one copy of a DAG for every day in the last month to backfill some data.

There are situations, though, where you don’t want to let some (or all) parts of a DAG run for a previous date; in this case, you can use the LatestOnlyOperator.

This special Operator skips all tasks downstream of itself if you are not on the “latest” DAG run (if the wall-clock time right now is between its execution_time and the next scheduled execution_time, and it was not an externally-triggered run).

Here’s an example:

airflow/example_dags/example_latest_only_with_trigger.py[source]

import datetime

import pendulum

from airflow.models.dag import DAG

from airflow.operators.empty import EmptyOperator

from airflow.operators.latest_only import LatestOnlyOperator

from airflow.utils.trigger_rule import TriggerRule

with DAG(

dag_id="latest_only_with_trigger",

schedule=datetime.timedelta(hours=4),

start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),

catchup=False,

tags=["example3"],

) as dag:

latest_only = LatestOnlyOperator(task_id="latest_only")

task1 = EmptyOperator(task_id="task1")

task2 = EmptyOperator(task_id="task2")

task3 = EmptyOperator(task_id="task3")

task4 = EmptyOperator(task_id="task4", trigger_rule=TriggerRule.ALL_DONE)

latest_only >> task1 >> [task3, task4]

task2 >> [task3, task4]

In the case of this DAG:

task1 is directly downstream of latest_only and will be skipped for all runs except the latest.

task2 is entirely independent of latest_only and will run in all scheduled periods

task3 is downstream of task1 and task2 and because of the default trigger rule being all_success will receive a cascaded skip from task1.

task4 is downstream of task1 and task2, but it will not be skipped, since its trigger_rule is set to all_done.

Depends On Past¶

You can also say a task can only run if the previous run of the task in the previous DAG Run succeeded. To use this, you just need to set the depends_on_past argument on your Task to True.

Note that if you are running the DAG at the very start of its life—specifically, its first ever automated run—then the Task will still run, as there is no previous run to depend on.

Trigger Rules¶

By default, Airflow will wait for all upstream (direct parents) tasks for a task to be successful before it runs that task.

However, this is just the default behaviour, and you can control it using the trigger_rule argument to a Task. The options for trigger_rule are:

all_success (default): All upstream tasks have succeeded

all_failed: All upstream tasks are in a failed or upstream_failed state

all_done: All upstream tasks are done with their execution

all_skipped: All upstream tasks are in a skipped state

one_failed: At least one upstream task has failed (does not wait for all upstream tasks to be done)

one_success: At least one upstream task has succeeded (does not wait for all upstream tasks to be done)

one_done: At least one upstream task succeeded or failed

none_failed: All upstream tasks have not failed or upstream_failed - that is, all upstream tasks have succeeded or been skipped

none_failed_min_one_success: All upstream tasks have not failed or upstream_failed, and at least one upstream task has succeeded.

none_skipped: No upstream task is in a skipped state - that is, all upstream tasks are in a success, failed, or upstream_failed state

always: No dependencies at all, run this task at any time

You can also combine this with the Depends On Past functionality if you wish.

Note

It’s important to be aware of the interaction between trigger rules and skipped tasks, especially tasks that are skipped as part of a branching operation. You almost never want to use all_success or all_failed downstream of a branching operation.

Skipped tasks will cascade through trigger rules all_success and all_failed, and cause them to skip as well. Consider the following DAG:

# dags/branch_without_trigger.py

import pendulum

from airflow.decorators import task

from airflow.models import DAG

from airflow.operators.empty import EmptyOperator

dag = DAG(

dag_id="branch_without_trigger",

schedule="@once",

start_date=pendulum.datetime(2019, 2, 28, tz="UTC"),

)

run_this_first = EmptyOperator(task_id="run_this_first", dag=dag)

@task.branch(task_id="branching")

def do_branching():

return "branch_a"

branching = do_branching()

branch_a = EmptyOperator(task_id="branch_a", dag=dag)

follow_branch_a = EmptyOperator(task_id="follow_branch_a", dag=dag)

branch_false = EmptyOperator(task_id="branch_false", dag=dag)

join = EmptyOperator(task_id="join", dag=dag)

run_this_first >> branching

branching >> branch_a >> follow_branch_a >> join

branching >> branch_false >> join

join is downstream of follow_branch_a and branch_false. The join task will show up as skipped because its trigger_rule is set to all_success by default, and the skip caused by the branching operation cascades down to skip a task marked as all_success.

By setting trigger_rule to none_failed_min_one_success in the join task, we can instead get the intended behaviour:

Setup and teardown¶

In data workflows it’s common to create a resource (such as a compute resource), use it to do some work, and then tear it down. Airflow provides setup and teardown tasks to support this need.

Please see main article Setup and Teardown for details on how to use this feature.

Dynamic DAGs¶

Since a DAG is defined by Python code, there is no need for it to be purely declarative; you are free to use loops, functions, and more to define your DAG.

For example, here is a DAG that uses a for loop to define some tasks:

with DAG("loop_example", ...):

first = EmptyOperator(task_id="first")

last = EmptyOperator(task_id="last")

options = ["branch_a", "branch_b", "branch_c", "branch_d"]

for option in options:

t = EmptyOperator(task_id=option)

first >> t >> last

In general, we advise you to try and keep the topology (the layout) of your DAG tasks relatively stable; dynamic DAGs are usually better used for dynamically loading configuration options or changing operator options.

DAG Visualization¶

If you want to see a visual representation of a DAG, you have two options:

You can load up the Airflow UI, navigate to your DAG, and select “Graph”

You can run airflow dags show, which renders it out as an image file

We generally recommend you use the Graph view, as it will also show you the state of all the Task Instances within any DAG Run you select.

Of course, as you develop out your DAGs they are going to get increasingly complex, so we provide a few ways to modify these DAG views to make them easier to understand.

TaskGroups¶

A TaskGroup can be used to organize tasks into hierarchical groups in Graph view. It is useful for creating repeating patterns and cutting down visual clutter.

Unlike SubDAGs, TaskGroups are purely a UI grouping concept. Tasks in TaskGroups live on the same original DAG, and honor all the DAG settings and pool configurations.

Dependency relationships can be applied across all tasks in a TaskGroup with the >> and << operators. For example, the following code puts task1 and task2 in TaskGroup group1 and then puts both tasks upstream of task3:

from airflow.decorators import task_group

@task_group()

def group1():

task1 = EmptyOperator(task_id="task1")

task2 = EmptyOperator(task_id="task2")

task3 = EmptyOperator(task_id="task3")

group1() >> task3

TaskGroup also supports default_args like DAG, it will overwrite the default_args in DAG level:

import datetime

from airflow import DAG

from airflow.decorators import task_group

from airflow.operators.bash import BashOperator

from airflow.operators.empty import EmptyOperator

with DAG(

dag_id="dag1",

start_date=datetime.datetime(2016, 1, 1),

schedule="@daily",

default_args={"retries": 1},

):

@task_group(default_args={"retries": 3})

def group1():

"""This docstring will become the tooltip for the TaskGroup."""

task1 = EmptyOperator(task_id="task1")

task2 = BashOperator(task_id="task2", bash_command="echo Hello World!", retries=2)

print(task1.retries) # 3

print(task2.retries) # 2

If you want to see a more advanced use of TaskGroup, you can look at the example_task_group_decorator.py example DAG that comes with Airflow.

Note

By default, child tasks/TaskGroups have their IDs prefixed with the group_id of their parent TaskGroup. This helps to ensure uniqueness of group_id and task_id throughout the DAG.

To disable the prefixing, pass prefix_group_id=False when creating the TaskGroup, but note that you will now be responsible for ensuring every single task and group has a unique ID of its own.

Note

When using the @task_group decorator, the decorated-function’s docstring will be used as the TaskGroups tooltip in the UI except when a tooltip value is explicitly supplied.

Edge Labels¶

As well as grouping tasks into groups, you can also label the dependency edges between different tasks in the Graph view - this can be especially useful for branching areas of your DAG, so you can label the conditions under which certain branches might run.

To add labels, you can use them directly inline with the >> and << operators:

from airflow.utils.edgemodifier import Label

my_task >> Label("When empty") >> other_task

Or, you can pass a Label object to set_upstream/set_downstream:

from airflow.utils.edgemodifier import Label

my_task.set_downstream(other_task, Label("When empty"))

Here’s an example DAG which illustrates labeling different branches:

airflow/example_dags/example_branch_labels.py[source]

with DAG(

"example_branch_labels",

schedule="@daily",

start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),

catchup=False,

) as dag:

ingest = EmptyOperator(task_id="ingest")

analyse = EmptyOperator(task_id="analyze")

check = EmptyOperator(task_id="check_integrity")

describe = EmptyOperator(task_id="describe_integrity")

error = EmptyOperator(task_id="email_error")

save = EmptyOperator(task_id="save")

report = EmptyOperator(task_id="report")

ingest >> analyse >> check

check >> Label("No errors") >> save >> report

check >> Label("Errors found") >> describe >> error >> report

DAG & Task Documentation¶

It’s possible to add documentation or notes to your DAGs & task objects that are visible in the web interface (“Graph” & “Tree” for DAGs, “Task Instance Details” for tasks).

There are a set of special task attributes that get rendered as rich content if defined:

attribute

rendered to

doc

monospace

doc_json

json

doc_yaml

yaml

doc_md

markdown

doc_rst

reStructuredText

Please note that for DAGs, doc_md is the only attribute interpreted. For DAGs it can contain a string or the reference to a template file. Template references are recognized by str ending in .md.

If a relative path is supplied it will start from the folder of the DAG file. Also the template file must exist or Airflow will throw a jinja2.exceptions.TemplateNotFound exception.

This is especially useful if your tasks are built dynamically from configuration files, as it allows you to expose the configuration that led to the related tasks in Airflow:

"""

### My great DAG

"""

import pendulum

dag = DAG(

"my_dag",

start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),

schedule="@daily",

catchup=False,

)

dag.doc_md = __doc__

t = BashOperator("foo", dag=dag)

t.doc_md = """\

#Title"

Here's a [url](www.airbnb.com)

"""

SubDAGs¶

Note

SubDAG is deprecated hence TaskGroup is always the preferred choice.

Sometimes, you will find that you are regularly adding exactly the same set of tasks to every DAG, or you want to group a lot of tasks into a single, logical unit. This is what SubDAGs are for.

For example, here’s a DAG that has a lot of parallel tasks in two sections:

We can combine all of the parallel task-* operators into a single SubDAG, so that the resulting DAG resembles the following:

Note that SubDAG operators should contain a factory method that returns a DAG object. This will prevent the SubDAG from being treated like a separate DAG in the main UI - remember, if Airflow sees a DAG at the top level of a Python file, it will load it as its own DAG. For example:

airflow/example_dags/subdags/subdag.py[source]

import pendulum

from airflow.models.dag import DAG

from airflow.operators.empty import EmptyOperator

def subdag(parent_dag_name, child_dag_name, args) -> DAG:

"""

Generate a DAG to be used as a subdag.

:param str parent_dag_name: Id of the parent DAG

:param str child_dag_name: Id of the child DAG

:param dict args: Default arguments to provide to the subdag

:return: DAG to use as a subdag

"""

dag_subdag = DAG(

dag_id=f"{parent_dag_name}.{child_dag_name}",

default_args=args,

start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),

catchup=False,

schedule="@daily",

)

for i in range(5):

EmptyOperator(

task_id=f"{child_dag_name}-task-{i + 1}",

default_args=args,

dag=dag_subdag,

)

return dag_subdag

This SubDAG can then be referenced in your main DAG file:

airflow/example_dags/example_subdag_operator.py[source]

import datetime

from airflow.example_dags.subdags.subdag import subdag

from airflow.models.dag import DAG

from airflow.operators.empty import EmptyOperator

from airflow.operators.subdag import SubDagOperator

DAG_NAME = "example_subdag_operator"

with DAG(

dag_id=DAG_NAME,

default_args={"retries": 2},

start_date=datetime.datetime(2022, 1, 1),

schedule="@once",

tags=["example"],

) as dag:

start = EmptyOperator(

task_id="start",

)

section_1 = SubDagOperator(

task_id="section-1",

subdag=subdag(DAG_NAME, "section-1", dag.default_args),

)

some_other_task = EmptyOperator(

task_id="some-other-task",

)

section_2 = SubDagOperator(

task_id="section-2",

subdag=subdag(DAG_NAME, "section-2", dag.default_args),

)

end = EmptyOperator(

task_id="end",

)

start >> section_1 >> some_other_task >> section_2 >> end

You can zoom into a SubDagOperator from the graph view of the main DAG to show the tasks contained within the SubDAG:

Some other tips when using SubDAGs:

By convention, a SubDAG’s dag_id should be prefixed by the name of its parent DAG and a dot (parent.child)

You should share arguments between the main DAG and the SubDAG by passing arguments to the SubDAG operator (as demonstrated above)

SubDAGs must have a schedule and be enabled. If the SubDAG’s schedule is set to None or @once, the SubDAG will succeed without having done anything.

Clearing a SubDagOperator also clears the state of the tasks within it.

Marking success on a SubDagOperator does not affect the state of the tasks within it.

Refrain from using Depends On Past in tasks within the SubDAG as this can be confusing.

You can specify an executor for the SubDAG. It is common to use the SequentialExecutor if you want to run the SubDAG in-process and effectively limit its parallelism to one. Using LocalExecutor can be problematic as it may over-subscribe your worker, running multiple tasks in a single slot.

See airflow/example_dags for a demonstration.

Note

Parallelism is not honored by SubDagOperator, and so resources could be consumed by SubdagOperators beyond any limits you may have set.

TaskGroups vs SubDAGs¶

SubDAGs, while serving a similar purpose as TaskGroups, introduces both performance and functional issues due to its implementation.

The SubDagOperator starts a BackfillJob, which ignores existing parallelism configurations potentially oversubscribing the worker environment.

SubDAGs have their own DAG attributes. When the SubDAG DAG attributes are inconsistent with its parent DAG, unexpected behavior can occur.

Unable to see the “full” DAG in one view as SubDAGs exists as a full fledged DAG.

SubDAGs introduces all sorts of edge cases and caveats. This can disrupt user experience and expectation.

TaskGroups, on the other hand, is a better option given that it is purely a UI grouping concept. All tasks within the TaskGroup still behave as any other tasks outside of the TaskGroup.

You can see the core differences between these two constructs.

TaskGroup

SubDAG

Repeating patterns as part of the same DAG

Repeating patterns as a separate DAG

One set of views and statistics for the DAG

Separate set of views and statistics between parent

and child DAGs

One set of DAG configuration

Several sets of DAG configurations

Honors parallelism configurations through existing

SchedulerJob

Does not honor parallelism configurations due to

newly spawned BackfillJob

Simple construct declaration with context manager

Complex DAG factory with naming restrictions

Packaging DAGs¶

While simpler DAGs are usually only in a single Python file, it is not uncommon that more complex DAGs might be spread across multiple files and have dependencies that should be shipped with them (“vendored”).

You can either do this all inside of the DAG_FOLDER, with a standard filesystem layout, or you can package the DAG and all of its Python files up as a single zip file. For instance, you could ship two DAGs along with a dependency they need as a zip file with the following contents:

my_dag1.py

my_dag2.py

package1/__init__.py

package1/functions.py

Note that packaged DAGs come with some caveats:

They cannot be used if you have pickling enabled for serialization

They cannot contain compiled libraries (e.g. libz.so), only pure Python

They will be inserted into Python’s sys.path and importable by any other code in the Airflow process, so ensure the package names don’t clash with other packages already installed on your system.

In general, if you have a complex set of compiled dependencies and modules, you are likely better off using the Python virtualenv system and installing the necessary packages on your target systems with pip.

.airflowignore¶

An .airflowignore file specifies the directories or files in DAG_FOLDER

or PLUGINS_FOLDER that Airflow should intentionally ignore. Airflow supports

two syntax flavors for patterns in the file, as specified by the DAG_IGNORE_FILE_SYNTAX

configuration parameter (added in Airflow 2.3): regexp and glob.

Note

The default DAG_IGNORE_FILE_SYNTAX is regexp to ensure backwards compatibility.

For the regexp pattern syntax (the default), each line in .airflowignore

specifies a regular expression pattern, and directories or files whose names (not DAG id)

match any of the patterns would be ignored (under the hood, Pattern.search() is used

to match the pattern). Use the # character to indicate a comment; all characters

on a line following a # will be ignored.

As with most regexp matching in Airflow, the regexp engine is re2, which explicitly

doesn’t support many advanced features, please check its

documentation for more information.

With the glob syntax, the patterns work just like those in a .gitignore file:

The * character will any number of characters, except /

The ? character will match any single character, except /

The range notation, e.g. [a-zA-Z], can be used to match one of the characters in a range

A pattern can be negated by prefixing with !. Patterns are evaluated in order so

a negation can override a previously defined pattern in the same file or patterns defined in

a parent directory.

A double asterisk (**) can be used to match across directories. For example, **/__pycache__/

will ignore __pycache__ directories in each sub-directory to infinite depth.

If there is a / at the beginning or middle (or both) of the pattern, then the pattern

is relative to the directory level of the particular .airflowignore file itself. Otherwise the

pattern may also match at any level below the .airflowignore level.

The .airflowignore file should be put in your DAG_FOLDER. For example, you can prepare

a .airflowignore file using the regexp syntax with content

project_a

tenant_[\d]

Or, equivalently, in the glob syntax

**/*project_a*

tenant_[0-9]*

Then files like project_a_dag_1.py, TESTING_project_a.py, tenant_1.py,

project_a/dag_1.py, and tenant_1/dag_1.py in your DAG_FOLDER would be ignored

(If a directory’s name matches any of the patterns, this directory and all its subfolders

would not be scanned by Airflow at all. This improves efficiency of DAG finding).

The scope of a .airflowignore file is the directory it is in plus all its subfolders.

You can also prepare .airflowignore file for a subfolder in DAG_FOLDER and it

would only be applicable for that subfolder.

DAG Dependencies¶

Added in Airflow 2.1

While dependencies between tasks in a DAG are explicitly defined through upstream and downstream

relationships, dependencies between DAGs are a bit more complex. In general, there are two ways

in which one DAG can depend on another:

triggering - TriggerDagRunOperator

waiting - ExternalTaskSensor

Additional difficulty is that one DAG could wait for or trigger several runs of the other DAG

with different data intervals. The Dag Dependencies view

Menu -> Browse -> DAG Dependencies helps visualize dependencies between DAGs. The dependencies

are calculated by the scheduler during DAG serialization and the webserver uses them to build

the dependency graph.

The dependency detector is configurable, so you can implement your own logic different than the defaults in

DependencyDetector

DAG pausing, deactivation and deletion¶

The DAGs have several states when it comes to being “not running”. DAGs can be paused, deactivated

and finally all metadata for the DAG can be deleted.

Dag can be paused via UI when it is present in the DAGS_FOLDER, and scheduler stored it in

the database, but the user chose to disable it via the UI. The “pause” and “unpause” actions are available

via UI and API. Paused DAG is not scheduled by the Scheduler, but you can trigger them via UI for

manual runs. In the UI, you can see Paused DAGs (in Paused tab). The DAGs that are un-paused

can be found in the Active tab. When a DAG is paused, any running tasks are allowed to complete and all

downstream tasks are put in to a state of “Scheduled”. When the DAG is unpaused, any “scheduled” tasks will

begin running according to the DAG logic. DAGs with no “scheduled” tasks will begin running according to

their schedule.

Dag can be deactivated (do not confuse it with Active tag in the UI) by removing them from the

DAGS_FOLDER. When scheduler parses the DAGS_FOLDER and misses the DAG that it had seen

before and stored in the database it will set is as deactivated. The metadata and history of the

DAG` is kept for deactivated DAGs and when the DAG is re-added to the DAGS_FOLDER it will be again

activated and history will be visible. You cannot activate/deactivate DAG via UI or API, this

can only be done by removing files from the DAGS_FOLDER. Once again - no data for historical runs of the

DAG are lost when it is deactivated by the scheduler. Note that the Active tab in Airflow UI

refers to DAGs that are not both Activated and Not paused so this might initially be a

little confusing.

You can’t see the deactivated DAGs in the UI - you can sometimes see the historical runs, but when you try to

see the information about those you will see the error that the DAG is missing.

You can also delete the DAG metadata from the metadata database using UI or API, but it does not

always result in disappearing of the DAG from the UI - which might be also initially a bit confusing.

If the DAG is still in DAGS_FOLDER when you delete the metadata, the DAG will re-appear as

Scheduler will parse the folder, only historical runs information for the DAG will be removed.

This all means that if you want to actually delete a DAG and its all historical metadata, you need to do

it in three steps:

pause the DAG

delete the historical metadata from the database, via UI or API

delete the DAG file from the DAGS_FOLDER and wait until it becomes inactive

Previous

Next

Was this entry helpful?

DAGs

Declaring a DAG

Task Dependencies

Loading DAGs

Running DAGs

DAG Assignment

Default Arguments

The DAG decorator

Control Flow

Branching

Latest Only

Depends On Past

Trigger Rules

Setup and teardown

Dynamic DAGs

DAG Visualization

TaskGroups

Edge Labels

DAG & Task Documentation

SubDAGs

TaskGroups vs SubDAGs

Packaging DAGs

.airflowignore

DAG Dependencies

DAG pausing, deactivation and deletion

Suggest a change on this page

Want to be a part of Apache Airflow?

Join community

© The Apache Software Foundation

License

Donate

Thanks

Security

Apache Airflow, Apache, Airflow, the Airflow logo, and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation.

All other products or name brands are trademarks of their respective holders, including The Apache Software Foundation.

What is a DAG and why is it important? - dbt Labs

What is a DAG and why is it important? - dbt Labs

Skip to main contentJoin our bi-weekly demos and see dbt Cloud in action!DocsReferencev 1.8 (Beta)1.71.61.51.41.31.2ResourcesCoursesBest PracticesGuidesDeveloper BlogGlossaryCommunityJoin the CommunityBecome a contributorCommunity ForumEventsSpotlightCreate a free accountSearchAnalytics Engineering GlossaryCTE in SQLDAGData catalogData extractionData lakeWhat is data lineage?Data warehouseData wranglingDataFrameDDLDeployingDimensional modelingDMLDRYEDWWhat is ELT (Extract, Load, Transform)?What is ETL (Extract, Transform, Load)?Data grainIdempotentJSONMaterializationModelMonotonically increasingpredicate pushdownPrimary keyRelational databaseReverse ETLSubquery in SQLSurrogate keyTableViewAnalytics Engineering GlossaryDAGOn this pageDAGA DAG is a Directed Acyclic Graph, a type of graph whose nodes are directionally related to each other and don’t form a directional closed loop. In the practice of analytics engineering, DAGs are often used to visually represent the relationships between your data models.While the concept of a DAG originated in mathematics and gained popularity in computational work, DAGs have found a home in the modern data world. They offer a great way to visualize data pipelines and lineage, and they offer an easy way to understand dependencies between data models.DAG use cases and best practices​DAGs are an effective tool to help you understand relationships between your data models and areas of improvement for your overall data transformations.Unpacking relationships and data lineage​Can you look at one of your data models today and quickly identify all the upstream and downstream models? If you can’t, that’s probably a good sign to start building or looking at your existing DAG.Upstream or downstream?How do you know if a model is upstream or downstream from the model you’re currently looking at? Upstream models are models that must be performed prior to the current model. In simple terms, the current model depends on upstream models in order to exist. Downstream relationships are the outputs from your current model. In a visual DAG, such as the dbt Lineage Graph, upstream models are to the left of your selected model and downstream models are to the right of your selected model. Ever confused? Use the arrows that create the directedness of a DAG to understand the direction of movement.One of the great things about DAGs is that they are visual. You can clearly identify the nodes that connect to each other and follow the lines of directions. When looking at a DAG, you should be able to identify where your data sources are going and where that data is potentially being referenced.Take this mini-DAG for an example:A miniature DAGWhat can you learn from this DAG? Immediately, you may notice a handful of things:stg_usersand stg_user_groups models are the parent models for int_usersA join is happening between stg_users and stg_user_groups to form the int_users modelstg_orgs and int_users are the parent models for dim_usersdim_users is at the end of the DAG and is therefore downstream from a total of four different modelsWithin 10 seconds of looking at this DAG, you can quickly unpack some of the most important elements about a project: dependencies and data lineage. Obviously, this is a simplified version of DAGs you may see in real life, but the practice of identifying relationships and data flows remains very much the same, regardless of the size of the DAG.What happens if stg_user_groups just up and disappears one day? How would you know which models are potentially impacted by this change? Look at your DAG and understand model dependencies to mitigate downstream impacts.Auditing projects​A potentially bold statement, but there is no such thing as a perfect DAG. DAGs are special in-part because they are unique to your business, data, and data models. There’s usually always room for improvement, whether that means making a CTE into its own view or performing a join earlier upstream, and your DAG can be an effective way to diagnose inefficient data models and relationships.You can additionally use your DAG to help identify bottlenecks, long-running data models that severely impact the performance of your data pipeline. Bottlenecks can happen for multiple reasons: Expensive joins Extensive filtering or use of window functionsComplex logic stored in viewsGood old large volumes of data...to name just a few. Understanding the factors impacting model performance can help you decide on refactoring approaches, changing model materializations, replacing multiple joins with surrogate keys, or other methods.A bad DAG, one that follows non-modular data modeling techniquesModular data modeling best practices​See the DAG above? It follows a more traditional approach to data modeling where new data models are often built from raw sources instead of relying on intermediary and reusable data models. This type of project does not scale with team or data growth. As a result, analytics engineers tend to aim to have their DAGs not look like this.Instead, there are some key elements that can help you create a more streamlined DAG and modular data models:Leveraging staging, intermediate, and mart layers to create layers of distinction between sources and transformed dataAbstracting code that’s used across multiple models to its own modelJoining on surrogate keys versus on multiple valuesThese are only a few examples of some best practices to help you organize your data models, business logic, and DAG.Is your DAG keeping up with best practices?Instead of manually auditing your DAG for best practices, the dbt project evaluator package can help audit your project and find areas of improvement.dbt and DAGs​The marketing team at dbt Labs would be upset with us if we told you we think dbt actually stood for “dag build tool,” but one of the key elements of dbt is its ability to generate documentation and infer relationships between models. And one of the hallmark features of dbt Docs is the Lineage Graph (DAG) of your dbt project.Whether you’re using dbt Core or Cloud, dbt docs and the Lineage Graph are available to all dbt developers. The Lineage Graph in dbt Docs can show a model or source’s entire lineage, all within a visual frame. Clicking within a model, you can view the Lineage Graph and adjust selectors to only show certain models within the DAG. Analyzing the DAG here is a great way to diagnose potential inefficiencies or lack of modularity in your dbt project.The Lineage Graph in dbt DocsThe DAG is also available in the dbt Cloud IDE, so you and your team can refer to your lineage while you build your models.Leverage exposuresOne of the newer features of dbt is exposures, which allow you to define downstream use of your data models outside of your dbt project within your dbt project. What does this mean? This means you can add key dashboards, machine learning or data science pipelines, reverse ETL syncs, or other downstream use cases to your dbt project’s DAG.This level of interconnectivity and transparency can help boost data governance (who has access to and who owns this data) and transparency (what are the data sources and models affecting your key reports).Conclusion​A Directed acyclic graph (DAG) is a visual representation of your data models and their connection to each other. The key components of a DAG are that nodes (sources/models/exposures) are directionally linked and don’t form acyclic loops. Overall, DAGs are an effective tool for understanding data lineage, dependencies, and areas of improvement in your data models.Get started with dbt today to start building your own DAG!Further reading​Ready to restructure (or create your first) DAG? Check out some of the resources below to better understand data modularity, data lineage, and how dbt helps bring it all together:Data modeling techniques for more modularityHow we structure our dbt projectsHow to audit your DAGRefactoring legacy SQL to dbt0Edit this pageLast updated on Mar 1, 2024PreviousCTE in SQLNextData catalogDAG use cases and best practicesUnpacking relationships and data lineageAuditing projectsModular data modeling best practicesdbt and DAGsConclusionFurther readingEdit this page

Terms of Service

Privacy Policy

Security

Cookie Settings

© 2024 dbt Labs, Inc. All Rights Reserved.