GraphQL Mirror EAV refactor

wchargin · October 6, 2019, 4:23pm

Status: completed

Champion^?:

Initiative description

This is a proposed change to the internal structure of the databases managed by the GraphQL Mirror module.

GraphQL objects have three kinds of fields (for our purposes): primitives, node references, and connections (paginated lists of node references). In the SQL database, all node references are stored in a links table, and all connections are stored in connections and connection_entries tables. But primitive values are stored in a separate table for each type, with columns for each primitive type on the object. For instance, a Repository has primitive fields url, name, and createdAt, and so we generate a table like this one:

CREATE TABLE "primitives_Repository" (
    id TEXT NOT NULL PRIMARY KEY,
    "url",
    "name",
    "createdAt",
    FOREIGN KEY(id) REFERENCES objects(id)
);

The proposal is to replace all these type-specific tables with a single primitives table to store primitives for all objects, just like there is a single links table to store node references for all objects. The schema would be:

CREATE TABLE primitives (
    rowid INTEGER PRIMARY KEY,
    object_id TEXT NOT NULL,
    fieldname TEXT NOT NULL,
    value,  -- JSON string, or SQL 0 or 1 for nest fields
    UNIQUE(object_id, fieldname),
    FOREIGN KEY(object_id) REFERENCES objects(id)
)

…so each object would have one row in primitives for each of its primitive fields.

This is called an entity–attribute–value (EAV) table.

Benefits

The two primary benefits are maintainability and security.

Because we currently use dynamically generated table schemata, we also need dynamically generated SQL statements to read from and write to these tables. It’s not obvious at a glance that these generated queries even parse correctly. (SQL doesn’t support trailing commas, so did we properly handle the edge case where a type has no primitive fields? Nest field names have a literal period in them, so did we remember to quote them correctly? But quoted identifiers in SQL are bizarre, too; do we need to worry about the case-sensitivity issues that they introduce?) By switching to the simpler unified-table model, it should be easier for anyone, especially SQL novices, to read and contribute to code.

More importantly, dynamically constructing SQL queries is structurally insecure. This is Rule #1 of SQL: never concatenate SQL source; always use prepared statements. If you use only prepared statements, SQL injection is not possible. If you dynamically create queries, the onus is on you. Our table design has made it impossible to adhere to this rule, so our table design needs to change.

Implementation plan

Especially in the early stages of a project, there are many places where we may feel comfortable cutting corners. The core data loading pipeline is not one of them.

This change should follow standard migration best practices, with an “A-AB-B” launch sequence:

Create the new tables and read/write logic, retaining the old ones, such that SourceCred writes to both forms but still reads from the old ones.
Let this soak for a bit.
Switch reading over to read from the new tables via a flag flip.
Let this soak for a bit.
Remove the old tables and read/write logic.

The idea is that if some latent bug makes it into the new version, we can revert to the old version by a changing just a single line of code, after which users would immediately resume loading from the old, correct version. Contrast this with rather than attempting to quickly revert a sequence of commits that were merged a while ago and may no longer revert cleanly.

See the GitHub tracking issue for how these phases map onto actual code changes as pull requests.

Throughout this whole process, the Mirror module should retain 100% test coverage, as it always has done. (And this means “useful tests”, too.)

Estimated work (hours)

For @wchargin: 10 hours of focused work, given that I have lots of context on the original code and have been planning this refactor in my head for just under a year.

(Note: This is an accurate estimate, because the work has already been done, and I’m calculating retrospectively based on my Git reflog and shell command history.)

Dependencies

None (other than the creation of the Mirror module itself).

References

GitHub tracking issue:

https://github.com/sourcecred/sourcecred/issues/1313

Contributions

Implementation:

wchargin · October 20, 2019, 1:15am

This is now completed.

Beanow · October 21, 2019, 9:37pm

@wchargin added you as champion here, feel free to edit if that’s not right

wchargin · October 27, 2019, 6:02pm

@Beanow: Thanks. I think that that makes sense in principle: I took on all the “things to do” for this initiative and handled it from start to finish. It feels strange to assign championship retroactively, because much of the value provided by championship is in uncertainty reduction, and this is far less relevant once the initiative has been completed. For instance, staking grain that you’ll complete the initiative is trivial once you’ve already completed it. But given that the essence of this initiative predates both the championship system and the initiative system by nearly a year, so that the championship is retroactive only as a technicality, I’m okay with adding the designation anyway.

Beanow · October 30, 2019, 1:44pm

Good point. I’m making the assumption that the dynamic of handling the overall process in a way that created confidence did happen, in spite of the champion label not yet being a thing or as explicit as this wiki template. I feel like recognizing that in retrospect isn’t contradictory but rather a great sourcecred property.

But @decentralion is probably the person most relying on that and in the best position to check my assumption :]

decentralion · November 13, 2019, 2:22am

We still have some work to do on defining what Championship means in practice, and how it affects cred creation. I think marking William as the champion for this makes sense in principle, because he defined the problem, took responsibility for it, handled it from start to finish.

However, insofar as championship means taking on extra risk (e.g. staking grain behind performance) and extra responsibility (expectations of the community) for extra reward (increased size of payout), it’s not clear if it’s fair to apply it retro-actively since at the very least, there was no extra risk taken.

As a tentative conclusion, I’m fine with William being the Champion for this initiative (and applying the same reasoning to other initiatives). But as we further articulate championship, and especially if we start minting extra cred for Championship, we should re-evaluate this decision.

Topic		Replies	Views
GraphQL Mirror fidelity awareness Initiatives ✔️completed	15	2089	March 1, 2020
Graph Mappings vs Graph Filtering Research	1	821	May 6, 2019
Technical planning discussion Didathing	0	834	November 26, 2019
Populating the graph from the bottom up Research	5	861	April 29, 2019
Research Design: Exploratory Data Analysis Research	9	991	April 17, 2019