Edge Tranches: Fixing an Incentive Misalignment

This post describes an incentive incompatibility in SourceCred today, along with a concrete plan to fix it.

Let’s suppose we’re interested in nodes that have two types of edges: AUTHORS edges and REFERENCES edges.

You’re the author of one of these nodes. Having written it, you know there are 9 relevant citations it could reference.

Let’s suppose that we’re in the current iteration of SourceCred, and AUTHORS and REFERENCES both have 1x weight. Your options:

  1. Hide all the references, and only add the AUTHORS edge. You get 100% of the posts cred flows, but you might get called out for putting none of the references.
  2. Add all references. You’ve been very honest, and now you only get 10% of the cred.
  3. Add only the 3 best known references. You’ve “cheated” 6 references from their cred, but now you get 25% of the post’s cred.

Clearly, this is a bad incentive, since it encourages people to under-reference. In reverse, there’s a sort of Gresham’s Law dynamic at play, where one can “spam away” cred from certain edges by adding other low-quality edges.

I think we can solve this by adding a new top-level division of cred between edge types, which I’m tentatively calling “edge tranches”. The basic idea is: a node’s cred is split up in advance between the different edge types. And if there are no edges of a given type, all of that cred is recycled to the seed vector. Let’s suppose we decide to split 50% cred to authors and 50% cred to references. Then your options are:

  1. Hide all the references. You get 50% of the cred, 50% gets recycled to the seed vector.
  2. Add the most popular 3 references. You get 50% of the cred, 50% is split between those 3 references.
  3. Add all 9 references. You get 50% of the cred, 50% is split across all the references.

As you can see, there is no longer any incentive to under-report the references. Therefore, I think this will be a substantial improvement over the current system.

(Of course, there is an incentive to preferentially add references that will flow cred back to you, e.g. referencing works that you yourself authored. But that is a problem for another Discourse post. :wink:)

If folks agree with taking this approach, we should write an initiative to scope out this change.

cc @mzargham @wchargin

3 Likes

Seems like an improvement from our current situation. I’m into it. Thanks for putting this together :slight_smile:

I know I am late to the party in reviewing this but I am working through my SourceCred backlog as my contribution to CredCon today.

Concur strongly with option 3. Sub-partitioning cred allocations by types preserves all necessary properties of the core algorithm.

There remains higher order gaming in the sense that you might self-cite to draw some of the citation allocation back to yourself. However, i think for the mean time this becomes a normative checkpoint. If you are genuinely building on your own past work, this should be fine, but to cite only your own work to capture the citation cred would generally be construed as an attack.

Also, here is an old “note” I wrote with David Sisson a round a year ago, before discourse was in use. It pertained to the MultiClass variation of pagerank and how it might be considered in sourcecred.

Happy CredCon! -Z

I’m not sure if this old-ish thread represents an idea that has already been implemented or further evolved elsewhere, but I just signed up to this forum to say that this is exactly the solution I wanted to see in SourceCred! I have some further suggestions.

40/60

I’d propose that we start with 60% cred flows to the author and 40% to the references. Not a big change, but then we can always say the author gets “most” of their cred, and the minority goes to references, which feels better than “you only get half” (though 40% is a huge minority).

Allocation Incentives

If the contribution is unreferenced, it goes back to the seed node, which I will call the “drain”. Though I know that this actually just “genericizes” the cred, sending it to the whole project indiscriminately. We can explain to authors that by linking no references, they merely give up their opportunity to say where that cred goes. They are incentivized to put references. Perfect.

The only remaining problems are around where they put it:

  • They can bias their choices to be a higher ratio of references that point back to them, vs. not.
  • They can under-report the references out of mere laziness, despite the fact that reporting all the references is better.
  • They can under-report the references out of respect for the major ones, since minor ones take the same slice.
  • They can report references exhaustively, despite the fact that a major references is wildly more important than any of the minor ones (say, a slight influence vs. a critical backbone).

Loopback factor (mitigate cred traps)

To mitigate the first problem, I suggest all potential target destinations be given a “loopback” score calculated for the given author. Think of it as microphone feedback - the mic can always hear the speaker at some negligible level, but only when the gap between them gets to a certain threshold does the chain reaction become audible and before you know it you have a horrible screeching noise - feedback.

We calculate a reference’s graph connections back to the author. If there are none, the loopback factor is zero. If there is one, we calculate it’s length in hops, and weigh it accordingly, where 0 hops (straight back) is maximum loopback, and half the total node count is minimum loopback (it would have to travel through a loop so big that it would involve half of all the nodes just to get back).

Alternatively, we can use the leakage at each node to determine how many hops it would take before the cred is diluted so much that it doesn’t matter - and then only look for loops smaller than that.

Finally, if there are multiple unique loopback paths of significant strength (which is perfectly normal), we combine them by weight, to get the reference’s final loopback factor.

A loopback factor of 0 means a reference gets 100% of the cred due, as usual. This decreases only slightly with a higher factor, at first. I’m imagining a cubic curve would do the trick. As the loopback factor gets close to the “suspicious zone”, the curve gets steep, which then flattens out near the top in the “obviously self-serving” zone. Near that point, only a small amount (say 8%) of the total reference cred gets considered for this reference.

The other 92%, however, is up for grabs to the other references, if they exist! The seed node takes whatever is left.

This way, it’s not pointless to reference yourself (which may be totally valid anyway) but not lucrative either. It’s also somewhat less lucrative to indirectly reference yourself, proportional to just how tightly knit the feedback loop is, but still totally worth it when you don’t do it too often.

The inverse of the loopback factor can be thought of as a diversity score - better yet, such a score could consider whether you’ve ever linked this far away on the graph before. Encouraging diverse references and countering local bias!

Not all references are created equal

To solve the remaining issues in the allocation bullet list above, what if we split references into different types? I understand this is already the case outside the OP example, but the critical piece is that there is a way to account for some “references” being far more important than others, so that people don’t leave out the little guys just to make sure the big ones get a big chunk.

Perhaps just three categories would do. 3 for the giants whose shoulders the author stood on. Dependencies and such. 1 for the passing influences, like minor wording tweaks, coloring ideas, emotional support. and 2 for everything in between.

These could literally get the 40% slice of the pie in a 3:2:1 ratio accordingly, and that might just work as-is. Success would mean that authors are always naming all their references, with so little to gain by culling them that they don’t bother trying.

2 Likes