GraphQL Mirror fidelity awareness

Status

Completed

Champion

@wchargin

Initiative description

This is a proposed change to the GraphQL Mirror API and implementation to make it more robust to incorrect and inconsistent query responses sent to us by the GitHub API.

Each field in a GraphQL schema has a declared return type. For example, the user field on the top-level query type of the GitHub API returns User, and the user field on the Reaction object type also returns User.

However, sometimes GitHub returns IDs for objects whose actual types do not match the types declared by the specification.

Example

For instance, the following query reports that the given reaction was created by a User with a specific ID…

{
  node(id: "MDg6UmVhY3Rpb24yMTY3ODkyNQ==") {
    __typename
    ... on Reaction {
      user {
        __typename
        id
        login
      }
    }
  }
}
{
  "data": {
    "node": {
      "__typename": "Reaction",
      "user": {
        "__typename": "User",
        "id": "MDEyOk9yZ2FuaXphdGlvbjE3OTUyOTI1",
        "login": "webineh"
      }
    }
  }
}

…but if we then ask for more properties about that User we are surprised to see that it is actually an Organization!

{
  node(id: "MDEyOk9yZ2FuaXphdGlvbjE3OTUyOTI1") {
    __typename
    id
  }
}
{
  "data": {
    "node": {
      "__typename": "Organization",
      "id": "MDEyOk9yZ2FuaXphdGlvbjE3OTUyOTI1"
    }
  }
}

This throws our loading pipeline for a loop, because we’ve received inconsistent data from the server. Currently, we work around this by manually blacklisting IDs of objects for which we’ve observed this problem (e.g., the reaction ID MDg6UmVhY3Rpb24yMTY3ODkyNQ== in the example above). Not only is this tedious, but it also means that we don’t properly count cred from those contributions and contributors: we ignore them entirely.

The purpose of this initiative is to make the GraphQL mirror structurally robust to these kinds of errors, such that we do not need to manually blacklist nodes. We should be able to tell the mirror that certain fields (or all fields) are “untrustworthy”, and that it should expect that results may be inconsistent and should take appropriate precautions.

Benefits

After this initiative is completed:

  • SourceCred will be able to confidently load repositories without manual intervention or blacklist updating.
  • SourceCred will be able to properly compute cred for users who have converted their accounts to organizations, and for bots like Greenkeeper that make automated pull requests.

Note that the first point is especially important for new users to SourceCred. If your first experience after hearing about this exciting project is, “cool, let me try it on my repository—wait, it’s just broken, and I have to send a pull request upstream with some weird IDs to fix it”, then there’s a fairly high chance that you just leave.

Implementation plan

GitHub issue #998 contains discussions of high-level system designs and a recommended option in the issue body, lower-level API designs in a comment, and a step-by-step implementation plan in a subsequent comment. These are fairly detailed; please refer to them for more information.

Estimated work (hours)

Around 20–40 focused hours for implementation. I’ve already spent at least 6–12 hours considering the problem space and solution space (including some helpful discussions with @decentralion), which led to the above design and implementation plans.

As with the GraphQL Mirror EAV refactor, it is critical that this work be done with careful attention to correctness, because it loads key data into a persistent store.

Dependencies

None, other than the creation of the Mirror module itself. The other current GraphQL Mirror initiative (the EAV refactor) is basically entirely unrelated, because it only covers primitive fields and fidelity only matters for non-primitive fields.

References

GitHub issues about this initiative:

Issues and pull requests that would not need to exist in a world where this were implemented (many from external contributors—not the best first experience):

Contributions

From this initiative:

Previously, contributor @wpank began a sequence of pull requests to address this issue, but these were never merged and have since gone stale.

4 Likes

Thank you for the detailed proposal, @wchargin. I’ve updated Make the GitHub Plugin Robust to point to this as a dependency.

Once we have Cred Bounties enabled, I propose we fund this with a large bounty.

1 Like

I’m going to start working on this. Can’t promise I’m going to nail it. Are there any steps to signify to the community that I’m working on this Initiative? Don’t want somebody else to come along and also start working on it without coordinating our efforts.

Thanks for the really detailed proposal @wchargin. It would have been very difficult to wrap my head around the Initiative otherwise. And also thanks for including @wpank 's contributions - I wouldn’t have thought to unearth those, but they’re very helpful as well.

Hi—I actually have started some work on this locally (maybe should have mentioned that in the “contributions” section, but they’re just private to me), so was planning on just finishing it up myself during CredCon or the week before. Hmm. I wonder what the best way to handle this is.

(apologies for quick note—have to run.)

One note: if I end up being primary on this and you’re still interested in being involved, I’d be happy to have you as a reviewer.

No problem at all, I haven’t written any code, only been trying to grok the design docs and the Mirror module - do you want to move the Initiative to the “In-Progress” lane?

Cool; thanks for understanding! I’ll take your comment plus @decentralion’s comment above as sufficient to move this past the proposal stage.

Since I’m now championing this, here’s a progress update from my work this weekend.

I have drafts of a sequence of changes that appears to be working end to end. With these changes, I can load both ipfs/js-ipfs-unixfs-engine and twbs/bootstrap with only default options and no blacklist file, which was previously not possible. I’ll go ahead and run full regression tests against the repositories and organizations currently represented in our blacklist file, but Bootstrap is a large repository (20K commits, 30K issues/PRs), and it didn’t surface any issues, so I’m hopeful that the rest will go smoothly, too.

I’ve written automated tests for some of these changes, but not all of them. Writing those tests is the primary remaining task before I’ll send out these changes for review.

Once the core API changes are in, the plugin-specific client changes are straightforward, requiring just one line changed per unfaithful field. Here’s the entirety of the diff to the GitHub plugin:

diff --git a/src/plugins/github/fetchGithubRepo.js b/src/plugins/github/fetchGithubRepo.js
index 6fed413e..7b28007d 100644
--- a/src/plugins/github/fetchGithubRepo.js
+++ b/src/plugins/github/fetchGithubRepo.js
@@ -64,3 +64,2 @@ export default async function fetchGithubRepo(
   const mirror = new Mirror(db, schema(), {
-    blacklistedIds: BLACKLISTED_IDS,
     guessTypename: _guessTypename,
diff --git a/src/plugins/github/schema.js b/src/plugins/github/schema.js
index 8e5e7c4c..7f098bae 100644
--- a/src/plugins/github/schema.js
+++ b/src/plugins/github/schema.js
@@ -94,3 +94,3 @@ export default function schema(): Schema.Schema {
       content: s.primitive(s.nonNull("ReactionContent")),
-      user: s.node("User"),
+      user: s.node("User", s.unfaithful(["User", "Organization"])),
       createdAt: s.primitive(s.nonNull("DateTime")),
@@ -117,3 +117,3 @@ export default function schema(): Schema.Schema {
         date: s.primitive(s.nullable("GitTimestamp")),
-        user: s.node("User"),
+        user: s.node("User", s.unfaithful(["User", "Bot"])),
       }),

I’ve opened pull requests for a few pre-work changes that can be cleanly separated out in order to make the main changes easier to review. I hope to get the final stack down to about 7 PRs.

3 Likes

Okay, I’ve finished the implementation and tests, and have run end-to-end tests on the whole blacklist. Everything seems to check out, so I’ve opened the following pull requests:

@brianlitwin: You’d previously expressed some interest in reviewing code for this initiative. How much do you feel like taking on, and is there anything that I can do to help you spin up on this? (May also ask @decentralion or @Beanow for reviews on some specific commits.)

1 Like

I would suggest tagging some others on the reviews; I’m not going to be able to provide much substantive feedback based on the complexity of this change and the mirror module. But I’m very happy to provide a second or third set of eyes; I had looked into this issue quite a bit, and looking at your implementation is a great learning experience!

Cool; thanks for letting us know! I just discussed this a bit with @decentralion and @Beanow. We anticipate that some new projects will want to start trying out SourceCred in the next few days, so merging this soon will be beneficial. We plan to merge on just cursory review for now, and revisit later with more detailed reviews for knowledge transfer.

The impetus behind the “merge soon on cursory review” requirement has passed, and we ended up deciding not to do that, so this initiative is currently blocked on review (#1662, #1665#1669). Friendly ping. :slight_smile:

I’ve approved all the outstanding PRs (based on “locally scoped” review – I read the code, but didn’t make an effort to build a full mental model of the change, so it’s more sanity/style check and not a design review).

Thanks! That’s reasonable. :slight_smile: We’re still missing review on #1662, which is at the bottom of the stack and so needs to go in first. This change should be pretty uncontroversial if you agree with the new types and docs added to the src/graphql/schema.js module.

Oops–sorry to overlook it. It’s approved now.

$ git log --oneline origin/master -n 8
eebacff1 (origin/master, origin/HEAD) github: replace blacklist with fidelity annotations (#1669)
2b58c687 mirror: permit and handle unfaithful fields (#1668)
5bf3d945 mirror: process typenames in query responses (#1667)
60bc7561 mirror: request typenames in GraphQL queries (#1666)
96157a2d mirror: permit storing objects without known typename (#1665)
5c937d86 mirror: add typename support to `UpdateResult`s (#1664)
477243fc mirror: add typename support to query plan (#1663)
b36ecb4d schema: add fidelity types (#1662)

It is done.

I’ve marked this initiative as completed, as the pull requests have been merged to master and the GitHub blacklist has been deleted. If you run into any holes or problems, please file a new issue on GitHub and CC me.

3 Likes