At work, I came across an interesting problem involving graphs for which I found no completely satisfactory solution.
A colleague tried to model applications and their dependencies as a directed graph. When editing dependencies for a given node, he wanted to show only nodes as new possible descendants which are not yet direct descendants of the current node. Additionally, when selecting new parents, nodes that are already direct parents should not be shown. Since he is still on MySQL 4.0, subselects could not be used.
Here is the data model and a bit of test data:
mysql> show create table nodes\G
*** 1. row ***
Table: nodes
Create Table: CREATE TABLE `nodes` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(30) NOT NULL default '',
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0,00 sec)
mysql> show create table edges\G
*** 1. row ***
Table: edges
Create Table: CREATE TABLE `edges` (
`start` int(11) NOT NULL default '0',
`end` int(11) NOT NULL default '0',
PRIMARY KEY (`start`,`end`),
KEY `end` (`end`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0,00 sec)
mysql> select * from nodes;
+----+-------+
| id | name |
+----+-------+
| 1 | one |
| 2 | two |
| 3 | three |
| 4 | four |
| 5 | five |
| 6 | six |
+----+-------+
6 rows in set (0,00 sec)
mysql> select * from edges;
+-------+-----+
| start | end |
+-------+-----+
| 1 | 2 |
| 1 | 3 |
| 3 | 4 |
| 3 | 5 |
| 5 | 6 |
+-------+-----+
5 rows in set (0,00 sec)
Excluding direct descendants
What we want is a list of all nodes, minus that nodes that are at the end of edges starting at our start node. Assume that our start node is 3, then we want all nodes except node 3 itself and nodes 4 and 5, because these are at the end of edges starting at 3.
A list of all nodes minus some nodes is a left join. So we get
mysql> select * from nodes left join edges on nodes.id = edges.end;
+----+-------+-------+------+
| id | name | start | end |
+----+-------+-------+------+
| 1 | one | NULL | NULL |
| 2 | two | 1 | 2 |
| 3 | three | 1 | 3 |
| 4 | four | 3 | 4 |
| 5 | five | 3 | 5 |
| 6 | six | 5 | 6 |
+----+-------+-------+------+
6 rows in set (0,01 sec)
We do not want nodes.id = 3, and we do not want these nodes that are at the end of edges starting at 3, so the final query is
mysql> select * from nodes left join edges on nodes.id = edges.end
-> where nodes.id != 3
-> and (edges.start != 3 or edges.start is null);
+----+------+-------+------+
| id | name | start | end |
+----+------+-------+------+
| 1 | one | NULL | NULL |
| 2 | two | 1 | 2 |
| 6 | six | 5 | 6 |
+----+------+-------+------+
3 rows in set (0,01 sec)
We have to include "edges.start is null", because we want the root node or any unconnected nodes - these are not at the end of any edges, and can thus be safely included.
Excluding direct parents
Excluding parents is not so easy. Look at the nodes that are at the end of edges starting at our nodes:
mysql> select * from nodes left join edges on nodes.id = edges.start;
+----+-------+-------+------+
| id | name | start | end |
+----+-------+-------+------+
| 1 | one | 1 | 2 |
| 1 | one | 1 | 3 |
| 2 | two | NULL | NULL |
| 3 | three | 3 | 4 |
| 3 | three | 3 | 5 |
| 4 | four | NULL | NULL |
| 5 | five | 5 | 6 |
| 6 | six | NULL | NULL |
+----+-------+-------+------+
8 rows in set (0,00 sec)
If we assume our current node to be 5, we'd have to exclude node 3, because 3 is at the start of an edge ending at 5. But node 3 is listed twice, since it is also at the end of an edge ending at 4. So the simple inversion of the descendants query is not going to work:
mysql> select * from nodes left join edges on nodes.id = edges.start
-> where ( edges.end != 5 or edges.end is null);
+----+-------+-------+------+
| id | name | start | end |
+----+-------+-------+------+
| 1 | one | 1 | 2 |
| 1 | one | 1 | 3 |
| 2 | two | NULL | NULL |
| 3 | three | 3 | 4 |
| 4 | four | NULL | NULL |
| 5 | five | 5 | 6 |
| 6 | six | NULL | NULL |
+----+-------+-------+------+
7 rows in set (0,00 sec)
What will work for simple forks (only two nodes share a parent) is something like poor man's subquery, but the general case for this will be just the subquery that MySQL 4.0 does not have:
mysql> select * from nodes left join edges on nodes.id = edges.start
-> left join edges b on nodes.id = b.start
-> and edges.end != b.end
-> where ( 5 not in (edges.end, b.end)
-> or edges.end is null or b.end is null )
-> and nodes.id != 5;
+----+------+-------+------+-------+------+
| id | name | start | end | start | end |
+----+------+-------+------+-------+------+
| 1 | one | 1 | 2 | 1 | 3 |
| 1 | one | 1 | 3 | 1 | 2 |
| 2 | two | NULL | NULL | NULL | NULL |
| 4 | four | NULL | NULL | NULL | NULL |
| 6 | six | NULL | NULL | NULL | NULL |
+----+------+-------+------+-------+------+
5 rows in set (0,00 sec)
Is there a solution that produces the desired result that does not involve a subquery?
(Thanks to Leithal and dammit on freenode:#mysql for their input and improvements)