CredRank: scalable, interpretable, flexible attribution

wchargin · March 2, 2020, 7:36am

Description

CredRank is a new variant of the SourceCred algorithm that takes a different approach to how we handle time. In early tests, CredRank:

improves cred analysis runtime by about 20×;
decreases output size by about 50×;
improves the interpretability of the algorithm, answering questions like “how much cred did this person earn this week?” with sound additive semantics; and
improves the flexibility of the algorithm, enabling features like bounded-time employment cred, node-dependent α-values, and granular time slicing.

Furthermore, our implementation sketch for CredRank introduces a data type for “Markov process graphs” with unidirectional edges weighted by pre-normalized transition probabilities. This simplifies the graph transformation pipeline by serving as a stepping stone from the plugin graph interface to raw Markov chains.

Conceptually, the main change is to transform edges incident to scoring nodes (users) to instead pass through an family of “epoch nodes” indexed by time period. We call this process fibration (by analogy with fiber bundles). The score of an epoch node for user u in time period t represents how much cred a user earned for contributions made in that time period. Epoch nodes send a portion of their cred to their owning user, and the rest along the out-edges that were previously incident to the owner. Adjacent epoch nodes have “webbing” edges forward and backward in time, so that a contribution’s cred is slightly smeared to mitigate spoiler effects and reduce variance. We also reify the teleportation vector as a node in the graph (the “seed node”) with in-edges from most nodes (but not epoch nodes) and out-edges in proportion to node weights. Thus, a Markov process graph can be converted to a standard Markov chain whose stationary distribution can be computed without any α-parameterization at the Markov process level.

While this is more an evolution of the old algorithm than a reinvention from scratch, we’ve adopted the name CredRank to clarify that the algorithm is meaningfully different from its “classic PageRank” cousin.

Status

In progress

Champions

@wchargin and @decentralion

Benefits

CredRank enables us to run cred analysis and grain distribution with a much tighter feedback cycle by dramatically improving the efficiency of cred computation with respect to time. Rather than distributing grain on a weekly basis, we could distribute it daily or hourly.

Specifically: The amount of work spent computing stationary distributions (which is expensive) used to be linear in |V| · |T|, where |V| is the number of nodes in the graph and |T| is the number of time periods. It’s now linear in |V| + |V_scoring| · |T|, where |V_scoring| is the number of scoring nodes in the graph, which is much smaller than |V|.
CredRank enables “employment cred”, wherein a company may agree to pay a contributor to work on a project in exchange for a share of the contributor’s cred in the project earned during the period of employment. It’s still the case that cred should not be “for sale” on a market; an employment cred relationship represents an agreement at the time that the cred was minted. CredRank provides a natural way to do this safely, by simply creating multiple out-edges from epoch nodes.

Several long-time SourceCred contributors have been sponsored to work on SourceCred and would like that sponsorship to be fairly reflected.
The implementation of CredRank enables cleaning up a lot of legacy core APIs, implementations, and UIs now that we’ve found better abstractions, improving accessibility both for interested new users and for maintainers.

Implementation plan

Factor out any pre-work from the wchargin-markov-process-graphs branch (e.g., #1676).
Productionize and write tests for the MarkovProcessChain data type, plus constructors (from plugin graphs) and destructors (to raw Markov chains).
Connect the new pipeline to a command-line interface (load and/or scores).
Connect the new pipeline to a user-facing web UI, either adapting the legacy cred view, adapting the timeline cred view, or creating something new.
Test against a variety of projects for efficiency and qualitative results. It’s okay and expected that distributions should change. The new distributions should be “at least as reasonable” as the old ones.

Deliverables

An implementation of MarkovProcessGraph that’s up to SourceCred quality standards.
A CredRank view integrated into the live web UI.

Dependencies

None yet.

References

Lessons from the initial cred algorithm and the timeline cred algorithm were important in identifying problems and shared vocabulary.
The “cli2” scaffolding from the new instance system was a big help in providing a low-friction integration point for the new types and analysis.

Contributions

@wchargin and @decentralion jointly developed the ideas and initial prototype for the algorithm at CredCon 2020 (drawing heavily on prior experience and insights).
A proof of concept implementation is available on the wchargin-markov-process-graph branch on GitHub.

wkarshat · March 6, 2020, 6:13am

Sounds smart and compelling, all Pro. Any Con at all? Like this coder’s surprised that adding nodes results in a much smaller output…

wchargin · March 6, 2020, 7:31am

Sure, there are always trade-offs. One con is that the explicit transition probabilities in the Markov process graph require some kind of “hierarchy” of weighting normalizations if we want to make them composable. Currently, the hierarchy only has two levels with four parameters (α is the seed probability, β is the epoch owner probability, γ₁ and γ₂ are epoch webbings), so we don’t have to reify it, but I can anticipate this becoming a source of friction.

In general, this formulation feels cleaner because we’re taking a proliferation of ad hoc APIs and consolidating them now that we understand the structure better, so it’s not surprising that it could be better than the current solution on multiple axes.

The size reduction comes from only needing to store one copy of the graph. You can think of it kind of like a normalization in the database sense.

Numbers are estimates, approximate, and from memory. The space usage one in particular I think was just a prediction by @decentralion and me rather, so I wouldn’t be surprised if I indeed have it wrong. (The runtime improvement is actually measured.) I’m not focusing on the precise figures; that’s not what we’re most excited about. The main benefits, as described, are the computation time and the flexibility. The computation time is significantly improved because we only compute stationary distributions once (on a graph that’s only moderately larger) rather than once per time interval.

s_ben · March 23, 2020, 5:24pm

Does there now exist a time-based attack vector, where a contributor could go back and boost earlier contributions, such that later contributions get more cred?

wchargin · April 20, 2020, 6:55am

I haven’t been actively working on this for a bit, but intend to pick it back up soon.

wchargin · June 8, 2020, 4:04am

(Sorry, just saw this comment!) At the risk of hazarding a guess: it seems unlikely that this would be effective across long time scales, because webbing path weights decay exponentially with the time difference (γ₁^(Δt/epoch period)). We’re definitely going to look at the effect of varying these parameters, so I expect that we’ll be able to play around with your interesting question a bit.

wchargin · November 2, 2020, 10:29pm

I’ve added @decentralion as co-maintainer here, since they’ve been driving high-quality progress on this initiative lately.

Topic		Replies	Views
A Gentle Introduction to Cred Research	10	3447	December 29, 2019
SourceCred Protocol The CredSperiment 📦artifact	5	2517	January 11, 2020
Timeline Cred Prototype The CredSperiment	4	1284	September 12, 2019
A primer on the SourceCred algorithm Research	0	913	January 25, 2020
Visualizing the SourceCred graph Initiatives Wish List	5	1956	February 16, 2020