SourceCred

GraphQL Mirror EAV refactor

Initiative description

This is a proposed change to the internal structure of the databases managed by the GraphQL Mirror module.

GraphQL objects have three kinds of fields (for our purposes): primitives, node references, and connections (paginated lists of node references). In the SQL database, all node references are stored in a links table, and all connections are stored in connections and connection_entries tables. But primitive values are stored in a separate table for each type, with columns for each primitive type on the object. For instance, a Repository has primitive fields url, name, and createdAt, and so we generate a table like this one:

CREATE TABLE "primitives_Repository" (
    id TEXT NOT NULL PRIMARY KEY,
    "url",
    "name",
    "createdAt",
    FOREIGN KEY(id) REFERENCES objects(id)
);

The proposal is to replace all these type-specific tables with a single primitives table to store primitives for all objects, just like there is a single links table to store node references for all objects. The schema would be:

CREATE TABLE primitives (
    rowid INTEGER PRIMARY KEY,
    object_id TEXT NOT NULL,
    fieldname TEXT NOT NULL,
    value,  -- JSON string, or SQL 0 or 1 for nest fields
    UNIQUE(object_id, fieldname),
    FOREIGN KEY(object_id) REFERENCES objects(id)
)

…so each object would have one row in primitives for each of its primitive fields.

This is called an entity–attribute–value (EAV) table.

Benefits

The two primary benefits are maintainability and security.

Because we currently use dynamically generated table schemata, we also need dynamically generated SQL statements to read from and write to these tables. It’s not obvious at a glance that these generated queries even parse correctly. (SQL doesn’t support trailing commas, so did we properly handle the edge case where a type has no primitive fields? Nest field names have a literal period in them, so did we remember to quote them correctly? But quoted identifiers in SQL are bizarre, too; do we need to worry about the case-sensitivity issues that they introduce?) By switching to the simpler unified-table model, it should be easier for anyone, especially SQL novices, to read and contribute to code.

More importantly, dynamically constructing SQL queries is structurally insecure. This is Rule #1 of SQL: never concatenate SQL source; always use prepared statements. If you use only prepared statements, SQL injection is not possible. If you dynamically create queries, the onus is on you. Our table design has made it impossible to adhere to this rule, so our table design needs to change.

Implementation plan

Especially in the early stages of a project, there are many places where we may feel comfortable cutting corners. The core data loading pipeline is not one of them.

This change should follow standard migration best practices, with an “A-AB-B” launch sequence:

  1. Create the new tables and read/write logic, retaining the old ones, such that SourceCred writes to both forms but still reads from the old ones.
  2. Let this soak for a bit.
  3. Switch reading over to read from the new tables via a flag flip.
  4. Let this soak for a bit.
  5. Remove the old tables and read/write logic.

The idea is that if some latent bug makes it into the new version, we can revert to the old version by a changing just a single line of code, after which users would immediately resume loading from the old, correct version. Contrast this with rather than attempting to quickly revert a sequence of commits that were merged a while ago and may no longer revert cleanly.

See the GitHub tracking issue for how these phases map onto actual code changes as pull requests.

Throughout this whole process, the Mirror module should retain 100% test coverage, as it always has done. (And this means “useful tests”, too.)

Estimated work (hours)

For @wchargin: 10 hours of focused work, given that I have lots of context on the original code and have been planning this refactor in my head for just under a year.

(Note: This is an accurate estimate, because the work has already been done, and I’m calculating retrospectively based on my Git reflog and shell command history.)

Dependencies

None (other than the creation of the Mirror module itself).

References

GitHub tracking issue:

Links to contributions

Implementation:

1 Like

This is now completed.