A Roadmap to Entity Relationships

Sander Mertens
20 min readJul 14, 2023

Introduction

Since entity relationships first got introduced to Flecs in 2021, other ECS implementations have also started adding support for them, or are in the process of doing so. To make adding support for entity relationships easier, and to get a better idea of all the different moving parts, I put together this article with a list of things to do.

The time estimates provided are roughly how much time I spent on an initial implementation if I had worked on it full time. Note that these should be taken with a grain of salt as almost every part of the implementation has been refined multiple times over the past four years. The main purpose is to give a sense of how much effort is involved in a single task, including time spent on writing documentation and test cases.

This article only covers fragmenting relationships. The term “fragmenting” indicates that entities with different relationship pairs end up in different archetypes (just like regular components- for more info see this article), which fragments memory. Flecs also supports non-fragmenting relationships, but how to implement those is not covered by this article.

This article assumes the ECS has an archetype storage. While you can probably add support for relationships to a non-archetype ECS, this will look very different and would require another article.

This article also assumes that entities are represented as at most 64 bit integers, with a number of bits reserved for a generation count and no more than 32 bits reserved for the actual id. For more information, see this article.

The Roadmap

1. Components as entities

An ECS needs some way to identify components internally, which is usually done by assigning a unique integer to each component type. With components as entities, an entity is created for each component in the ECS, and the entity id is used as the unique identifier.

This achieves two things. First, it will prepare the storage for being able to handle runtime component ids which is a prerequisite for relationships. Second, it ensures that both elements of a relationship pair (e.g. (Eats, Apples)) are created from the same id space, and can fit into a single 64bit integer which as we’ll see in a bit is important.

With this supported, the ECS should be able to handle scenarios like:

entity platoon = world.entity();
entity unit = world.entity();
unit.add(platoon);

Estimate: 4 weeks

2. Observers

Now that components are entities, you’ll often find yourself in a situation where you’d like to be able to respond to a specific component being added to a component (called a “trait”- we’ll get to those in a bit).

To avoid having to implement separate APIs for traits, we need some mechanism that can respond to adding traits, so that we can update internal administration where necessary. This is where observers come in, which let us write code that gets executed when a specific component is added or removed.

Observers aren’t just useful for keeping ECS internals in sync, they’re also one of the most used features in Flecs applications.

With this supported, the ECS should be able to handle scenarios like:

world.observer<Position>()
.event(OnAdd)
.each([](Position& p) {
// ...
});

Estimate: 4 weeks

3. Support relationship pairs in archetype storage

This creates an encoding for a relationship pair that can be used in the archetype storage. An archetype has a sorted vector of ids that indicate which components the archetype stores.

To add support for relationships, we must be able to encode a relationship pair in a single 64 bit value. We can do this by taking the 64 bit entity ids of the pair, and strip off the generation count so that we’re left with two 32 bit ids. We can then combine these two 32 bit ids into a single 64 bit id, which we can store in the archetype’s component id vector.

Note that we can safely drop the generation count from the elements in a pair as archetypes should never have pairs with ids that are no longer alive, which will be guaranteed once we get to relationship cleanup.

Which id ends up where in the 64 bit value matters. We’ll want to make sure that all (Eats, *) ids are stored together as this will help later on with implementing efficient wildcard queries.

The archetype id vector is sorted, so we can achieve this grouping by putting the relationship (Eats) in the upper 32 bits of the 64bit value and the relationship target (Apples) in the lower 32 bits. For more details on what this encoding could look like, see this article.

With this supported, the ECS should be able to handle scenarios like:

entity Eats = world.entity();
entity Apples = world.entity();
entity e = world.entity();
e.add(Eats, Apples);

Estimate: 2 weeks

4. Relationship components

The storage now supports adding pairs to entities. In many ways these pairs behave like tags (components without data), but often it’s useful to be able to associate data with a pair. To support this, we need to be able to associated a type with a pair.

For regular components this is simple, as the component id is also the id that identifies the type. For relationship pairs, we need a set of rules that tells us which element of the pair is used as type. In Flecs, these rules looks like this, applied in this order:

  • If neither the first nor second elements are a type, the pair is a tag
  • If the first element has the “tag” trait, the pair is a tag
  • If the first element is a type, the pair type is the first element
  • If the second element is a type, the pair type is the second element

These rules allow the type to be derived from either the first or second element of a pair, which can be handy. The second rule mentions a “tag” trait. A trait, put simply, is a component that’s added to a component. Because our components are entities, this means we can add things to them, and these things are called traits.

The tag trait specifies that even if the second element of a pair is a type and the first element is not, the pair should not assume the type of the second element. This is useful for relationships like ChildOf: you wouldn’t for example want the(ChildOf, Position) pair to assume the type Position.

Both the storage and query implementation will have to be updated to derive the correct component type from a pair.

With this supported, the ECS should be able to handle scenarios like:

struct Eats { 
int amount;
};

entity Apples = world.entity();
entity e = world.entity();
e.set<Eats>(apples, {2});

Estimate: 4 weeks

5. Wildcard queries

At this point (cached) queries should be able to handle relationship pairs even if they haven’t explicitly been updated for it. This is because the relationship pair encoding is essentially just a really large component id, and queries should already be able to work with component ids.

The first real extension to queries is wildcards, which introduce the ability to query for things like (Eats, *) and (*, Apples), and even * and (*, *). For this two things need to happen. The first one is easy, which is to introduce a new entity id that can represent a wildcard. That way we can construct pair ids with wildcards.

The second thing is to update our query cache implementation to return results for each wildcard match. This essentially means that the same archetype can be added multiple times to the query cache, once for each pair that matches the wildcard.

For example, an archetype with (Eats, Apples), (Eats, Pears) should be added twice to the query cache for a query that matches (Eats, *). Finding all instances of (Eats, *) is easy as we can start at the first match, and iterate until we find an id that no longer matches.

To find all matches for(*, Apples) we can also start at the first match, but we’ll have to scan the remainder of the archetype since these pairs aren’t grouped together because of how we designed the pair encoding.

To find the first occurrence of a match in either case we can just scan the component id vector or, if the ECS already has it, use the component index data structure described below.

Flecs also supports the _ wildcard, which is the same as * except that it returns at most one match. This can be easily supported by adding another entity to represent the _ wildcard, and writing the corresponding matching logic for it in the query engine.

With this supported, the ECS should be able to handle scenarios like:

query q = world.query_builder()
.with<Eats>(Wildcard)
.build();

q.each([](iter& it, size_t index) {
entity food = it.pair(1).second();
entity e = it.entity(index);
cout << e.name() << " eats " << food.name() << "\n";
});

Estimate: 2 weeks

We now have a rudimentary implementation of relationships, but there are still big gaps in functionality. It works well as long as we have a finite list of relationship pairs, but we run into problems when this list is unbounded. The next items turn our basic implementation into something that’s more general purpose.

6. The component index

The component index is a data structure that lets us find all archetypes with a specific component. It’s been described in this article and extended in this article. For relationships, we add another field to the ArchetypeRecord which is a count, indicating how often an id occurs in an archetype.

With the component index, we can optimize our wildcard code in the query implementation. There are many other aspects of the ECS that could also benefit from the component index- but that’s for another post.

To make wildcards work, archetypes need to be registered for wildcard ids in the component index. What this means is that an archetype with pair (Eats, Apples), (Eats, Pears)needs to be registered for the following ids:

  • (Eats, Apples) column = 0, count = 1
  • (Eats, Pears) column = 1, count = 1
  • (Eats, *) column = 0, count = 2
  • (*, Apples) column = 0, count = 1
  • (*, Pears) column = 1, count = 1

Estimate: 4 weeks

7. Cleanup

A problem our current implementation has is that archetypes aren’t cleaned up when an entity used in a pair is deleted. For example, if we have an archetype with (ChildOf, my_parent), and we delete entity my_parent, the archetype won’t get cleaned up. This is problematic not just because it leaks memory, but also because entity ids are recycled, and my_parent could be reused for an entirely different purpose.

To solve this, we need some way to cleanup archetypes when entities used by archetypes are deleted. This doesn’t necessarily just apply to entities used in pairs, and can also apply to regular component entities.

Fortunately we don’t have to iterate all archetypes each time we delete an entity. With the component index in place, we can find all archetypes that need to be deleted in a few quick O(1) operations. In essence, when an entity e is deleted, we iterate and delete all archetypes with:

  • e
  • (*, e)
  • (e, *)

This is where things get a bit more complicated. To cleanup archetypes, all references to the archetype must also be deleted. This means that it must be deleted from the hashmap that finds archetypes by component id vector, and, if the ECS implements it, all incoming and outgoing edges from the archetype must also be cleaned up. The archetype must also be unregistered from the component index. When a deleted archetype contains an entity that’s used by another set of archetypes, those archetypes also have to be cleaned up.

Additionally, query caches must be notified do delete all instances of the deleted archetype (instances- because wildcard queries can cause an archetype to get inserted multiple times). If up to this point a query cache was a simple vector of archetypes, a new data structure will have to be introduced for more efficient removal. Otherwise you’d get an O(A*Q*N) operation, where A=the number of deleted archetypes, Q=the number of queries, and N=the number of archetypes per query cache.

An ECS implementation will likely also want to introduce a way to flag entities that require complex cleanup, so that not every deleted entity needs to consider whether archetypes need to get cleaned up. Flecs accomplishes this by reserving a few bits in the entity’s row which tell the code what kind of cleanup is necessary.

Something that complicates cleanup is that cleaning up archetypes can create new archetypes. For example: when entityApples is deleted, all entities in archetype Position, (Eats, Apples) need to be moved to archetype Position. It is not guaranteed that this archetype already exists, which means that archetype cleanup can cause archetype creation. Furthermore, this newly created archetype can in theory also contain an id that is about to be deleted.

This is one of the bigger tasks, but essential for relationships as it guarantees our storage doesn’t have dangling references, and makes sure our relationship pairs don’t suddenly point to garbage entities.

Estimate: 8 weeks

8. Cleanup traits

Now that we can cleanup archetypes, we’ll want to be able to customize what kind of cleanup action is required on a per-relationship level. In Flecs this is implemented with cleanup traits, which are traits added to a component or relationship entity that tell the storage what to do when an entity is deleted.

Flecs cleanup traits are relationship pairs, where the first element of the pair specifies the cleanup condition, and the second element what action is performed. The supported conditions are:

  • OnDelete
  • OnDeleteTarget

An OnDelete policy is executed when an entity used by itself or as relationship is deleted. An OnDeleteTarget policy is executed when an entity used as relationship target is deleted. The following actions are supported:

  • Remove
  • Delete
  • Panic

Remove is the default action, and will simply remove the tag/component/pair from all entities that have it. Delete will delete the entities with the tag/component pair, where Panic will throw a panic if any entities with the tag/component/pair exist.

A common trait to add is (OnDeleteTarget, Delete), which specifies recursive cleanup and is useful in combination with hierarchies (if a parent is deleted, also delete its children).

Executing the correct cleanup policy in all scenarios is a complex task that has to cover many edge cases. One noteworthy scenario to call out is what needs to happen during command execution.

Consider that (ChildOf, my_parent) was added to an entity, and my_parent was deleted in the same command buffer as where the ChildOfpair was added. The correct action for any order of these commands is that the entity to which the pair was added is also deleted. This requires the command execution logic to be aware of cleanup policies.

Estimate: 8 weeks

9. Multi-source queries

Most ECS implementations support queries on a single source, that is all queried for components are matched on a single entity. Relationship queries can span multiple entities, which means queries must be able to match components on multiple entities.

The simplest version of this is a query that matches a component on an entity known at query creation time. This could be something like a TimeOfDay component that’s matched on a game entity, which is made available as a query field during iteration.

Most of the complexity here is in communicating ownership to the query iterator, as it needs to know whether the data it’s making available is an array of values or a single value. It’s also useful for the iterator to provide information on what the source for a field is, which will come in handy later when the source isn’t always known upfront.

With this supported, the ECS should be able to handle scenarios like:

query<Position, TimeOfDay> q = world.query_builder<Position, TimeOfDay>()
.term_at(2).src(Game)
.build();

q.each([](Position& p, TimeOfDay& t) {
// ...
});

Estimate: 2 weeks

10. Relationship traversal

One of the most commonly used features is the ability of a query to traverse a relationship to find a certain component.

A typical example is a query that requests a Position from a parent entity. The Position component may be found on the direct parent of an entity, or further up the tree.

Another example is a tree of UI elements, where element styles can be specified on multiple levels in the tree, and cascade down to children. A query in this case would have to be able to traverse an unknown number of entities to find a specific style component.

A naive traversal function is relatively straightforward to implement, and because queries cache results, doesn’t have to be super performant to be useful. Storing the result of traversal and making the component available can largely reuse the code written for 8.

With this supported, the ECS should be able to handle scenarios like:

query q = world.query_builder<Position, Position>()
.term_at(2).up(ChildOf)
.build();

q.each([](Position& p_child, Position& p_parent) {
// ...
});

Estimate: 4 weeks

11. Query cache revalidation

Query traversal, while useful, introduces a tricky scenario: what happens if the entity found through traversal no longer has the component? What if another entity now has the component and is encountered first?

When these things happen, the state of a query cache is no longer valid, and it needs to be revalidated. There is a simple way of doing this, which is to just throw away the cache and rebuild it from scratch. The challenging part is to know when revalidation needs to happen.

This requires the creation of another mechanism in the ECS that can monitor component changes (what if parent p no longer has Position) and changes to the relationship graph (what if parent p got reparented). Because revalidation is an expensive operation (we’ll get to ways of optimizing it- but it remains costly), the monitoring mechanism should be as specific as possible, while also making sure not to miss any event that could have invalidated one of our caches.

Estimate: 12 weeks

12. Breadth first traversal

One of the most common requirements for systems is to iterate a hierarchy from top to bottom. The most typical example of this is a transform system, where parents need to be transformed before their children. While this can be hand coded by manually fetching the parent, children and their components, there is a much faster way to accomplish this.

The trick we can use is to sort the entries in the query cache based on their depth in the hierarchy. This gives us a zero-cost way to iterate a tree from top to bottom, as it just requires iterating the cache entries as usual.

Sorting the entire cache whenever a new archetype is created (or deleted!) can be expensive though, but fortunately there is a solution for that too: a grouping mechanism that groups archetypes in the cache together based on some numeric value (in this case the hierarchy depth).

We can use a hashmap to find the start and end of a group, which means we can add/remove archetypes to the cache in constant time without having to resort it.

Note that this adds another scenario under which cache invalidation can happen: even though a component is still matched on the same entity, the shape of the graph (and therefore the depth of a cache entry) could have changed.

With this supported, the ECS should be able to handle scenarios like:

query q = world.query_builder<Position, Position>()
.term_at(2).cascade(ChildOf) // cascade enables BFS
.build();

q.each([](Position& p_child, Position& p_parent) {
// ...
});

Estimate: 4 weeks

13. Uncached queries

One characteristic of fragmenting relationships is that archetype creation can be constant. This means that we need to be a bit more mindful of how much time we spend on matching archetypes with queries. We’ll also have a lot more archetypes, so any scenario that requires iterating all archetypes in the world (as is often done when initializing a query cache) quickly becomes a performance hog.

Additionally, applications will often want to find entities for a specific parent, platoon, faction or other entity that’s not known in advance. Creating a cached query for these quick ad-hoc requests is (too) expensive.

To address these scenarios, we implement a new uncached query mechanism. These are queries with the exact same query interface as regular queries, but, as the name implies, they don’t cache their results. Instead of iterating all archetypes in the world, uncached queries can use the component index and backtracking to quickly find matching archetypes, which speeds up query cache initialization and revalidation.

That still leaves matching a new archetype with an existing query. For this we can extend our uncached query implementation to allow us to test it against a single archetype or entity. In Flecs this is achieved by constraining the source of the query before we iterate it, which looks something like:

auto q = world.query<Position, Velocity>();

entity e = world.entity()
.set<Position>();

bool is_match = q.set_var("this", e).is_true();

When done right, all query matching logic ends up in the implementation of uncached queries, with all the logic for keeping the cache up to date ending up in the cache implementation. Implementing uncached queries and porting over existing code is a lot of work, but is one of the essential building blocks that makes relationships performant.

Estimate: 16 weeks

14. Multi component observers

In large applications, matching archetypes with queries can become expensive as applications can have hundreds, sometimes thousands of queries. To speed this process up, it would be ideal if we can narrow down the queries we need to notify.

We can do this by adding support for multi component observers. Multi component observers are essentially uncached queries combined with an event and callback.

What this allows us to do is create an observer for each query, and emit observer events when creating archetypes. This in practice limits the number of times an archetype is matched to the number of queries that have at least one component in common with the new archetype.

Multi component observers are also a useful feature for applications, as they provide the ability to monitor queries. With this supported, the ECS should be able to handle scenarios like:

world.observer<Position, Velocity>()
.event(OnAdd)
.each([](Position& p, Velocity& v) {
// ...
});

Estimate: 4 weeks

15. Event propagation

Now that observers are used to notify queries, we need to make sure that observers correctly handle relationship traversal.

Event propagation is a feature where events can be propagated along a relationship edge (either up or down). Event propagation allows for traversing an edge and notifying encountered observers.

While the basic concepts of event propagation are relatively straightforward, in actual applications propagation will happen frequently as result of structural changes we make to our entities, and so the mechanism needs to be fast or it’ll slow down many ECS operations.

To speed up propagation, Flecs implements a memoization data structure called the “reachable cache”. This data structure stores the components that are reachable from a certain relationship pair. For example, if entity parent has components Velocity, (ChildOf, grandparent), and entity grandparent has Position, then the reachable cache for (ChildOf, parent) has [Position, Velocity, (ChildOf, grandparent)].

The reachable cache helps with event propagation for observers that have traversal. For example, if I create a new archetype with (ChildOf, parent), the reachable cache can quickly tell me which components can be found if by traversing the ChildOf relationship upwards, without having to actually do the traversal for each emitted event.

Just like with query cache invalidation, the hardest part of the reachable cache implementation is knowing when to invalidate the cache. This took a long time to stabilize in Flecs, and required techniques such as fuzzing the APIs to find hard to reach (pun intended) scenarios.

Estimate: 16 weeks

15. Empty table optimization

Relationships increase archetype churn and component count per archetype, and as a result game worlds will have a larger number of empty archetypes. If left unchecked, this can become a significant factor in query performance, both for cached and uncached queries.

To counter this, Flecs implements a mechanism that tracks when a table becomes empty/non-empty, and propagates this to both the component index and query caches. Both the component index and query cache keep separate lists of empty/non-empty tables, which guarantees that while iterating a query (cached or uncached), it never spends time iterating empty tables.

Estimate: 4 weeks

16. Garbage collection

Applications that make heavy usage of relationships might see an explosion in relationship pair combinations, and thus archetypes. In some cases the total number of unique combinations is too large, either for performance reasons (it increases query matching time) or just because it occupies too much memory.

To address these kinds of scenarios, a garbage collection mechanism is required that can periodically get rid of internal data structures that are no longer used. While it is possible to just outright delete archetypes when they become empty, that can end up increasing costly archetype creation if some archetypes are only empty for a few frames.

In Flecs a garbage collection function is implemented that deletes archetypes after they have been empty for N garbage collection cycles. This function is invoked manually by the application and can be time-limited, so that garbage collection cycles never cause frame stalls.

In practice, when this function is called often enough, the number of archetypes to cleanup will be small, and the function doesn’t take longer than a few microseconds.

Estimate: 1 week

17. Rule engine

The rule engine is a query engine with features that allow for matching complex patterns against entity graphs. This makes it possible to use the ECS as a kind of relational database (more information in this article), where complex questions can be asked to drive game logic, such as:

  • Is there anyone in this tavern that belongs to a rival faction?
  • Which vendors in this city sell healing potions?
  • On which tiles can I build a windmill?
  • What equipment can I mount on my spaceship?

The rule engine extends uncached queries with a number of features that makes it possible to answer questions like these. While there are a few dozen things the rule engine does, the three largest extensions are:

  • Query variables
  • Component Inheritance
  • Transitivity

Each of these features are explained in more detail in the article linked above. While these features can be implemented as extensions to our backtracking uncached query implementation, a number of optimizations will have to be implemented to ensure these queries evaluate efficiently.

Consider component inheritance, where each subtype of a component needs to be evaluated in addition to the component itself. A backtracking solution can evaluate the same term of a query many times as it moves back and forth between possible solutions. If for each time a term is visited the component inheritance graph would have to be walked again, it would make query evaluation slow. To address this, memoization data structures are used to limit how often the graph is walked.

Additionally, the combination of different query operators, variables, wildcards, inheritance and transitivity is a lot of complexity to handle by a query engine. In a naive implementation, this could require excessive branching to figure out which features are needed by a term. To address this, the rule engine uses a virtual machine-like design. This greatly limits the number of branches, as the engine only runs code for the features required by the query, which is similar to an SQL query plan.

While the rule engine is an extension of uncached queries, some care must be taken when these features are used in combination with cached queries. In some cases rule queries return individual entities instead of archetypes, and the query cache is not designed to deal with individual entities, nor would it be very performant to keep a cache like this up to date. In practice, an implementation will have to only cache the terms that can be cached, and evaluate the remaining terms on the spot.

A future article will dive deeper in the nuances and data structures used by the rule engine.

Estimate: 8 weeks

18. Exclusive relationships

Exclusive relationships are relationships where an entity can only have a single instance. A typical example is ChildOf, as an entity can only have a single parent. Adding an exclusive relationship pair to an entity that already has one replaces the previous one.

Exclusive relationships can be efficiently implemented by encoding the added/removed pairs on the archetype graph. For example, an edge that adds (ChildOf, parent_2) to an entity that already had (ChildOf, parent_1) can point to an archetype where both (ChildOf, parent_2) is added and (ChildOf, parent_1) is removed.

With this supported, the ECS should be able to handle scenarios like:

child.add(ChildOf, parent_1);
child.add(ChildOf, parent_2); // replaces (ChildOf, parent_1)

Estimate: 1 week

18. Inheritance

Inheritance is a mechanism that makes it possible to share components between entities through an IsA relationship. This is a commonly used feature in Flecs applications, and is at the core of the Flecs prefabs implementation.

Inheritance touches many aspects of the API, such as the get operation automatically fetching inherited components and queries automatically matching inherited components. For a comprehensive list of features enabled by inheritance, see the Flecs prefab examples.

With this supported, the ECS should be able to handle scenarios like:

entity prefab = world.entity()
.set<Mass>({100});

entity instance = world.entity()
.add(IsA, prefab);

instance.get<Mass>(); // gets prefab.Mass

Estimate: 12 weeks

19. Query DSL

Sometimes you’ll want to create queries after the code for an application is compiled. A typical example of this is tools like the Flecs Explorer, where queries can be entered as a string. To enable this, the ECS will have to support a DSL (domain specific language) that can parse a string into an ECS query. The following example shows the same query created in the Flecs C++ API, and the query DSL:

query<Position> q = world.query_builder<Position>()
.without<Velocity>()
.with<Position>().parent()
.with<Likes>(Wildcard)
.build();
Position, !Velocity, Position(parent), (Likes, *)

A feature-complete DSL can be a powerful tool when debugging applications, and can even provide database-like access to a game, which can be useful when, for example, inspecting a multiplayer game server. While it’s not mandatory to implement for relationships, it is a useful tool to have, and relationships have a big impact on the design of the DSL.

Estimate: 2 weeks

Summary

Hopefully this provided some insight in what it takes to build a relationship implementation from scratch! When adding up all the estimates it comes out to about 2 years, which roughly corresponds with the time I spent on it in total- so far. I’ll try to keep this article up to date with any new developments!

If you’ve enjoyed this blog post, consider giving Flecs a ⭐️

Also check out the Discord community which has lots of amazing people working on bleeding-edge projects:

--

--

Sander Mertens

Author of Flecs, an Entity Component System for C and C++