This is going to be a primer on understanding the SourceCred algorithm and where it lives in the codebase. This is going to be a messy, non-canonical document. I’m writing this primarily to share with @vsoch, but posting it publicly in case it’s helpful for others.
The first thing to know about SourceCred is that we’re oriented around a core data structure: the graph. Although technically, it’s a quiver as the edges are directed, and we allow loop edges and multiple edges between the same pair of nodes.
Every node in the graph will potentially accumulate cred. This means that both contributions (like this post) and contributors (like me) are nodes in the graph. Every edge is directed, (it points from a
src to a
dst), but in general, we flow cred in both directions on each edge. For a deeper look at how we think about edge directionality, read this post.
SourceCred assigns scores based on weighted graphs; that is, graphs where every node has a weight, and every edge has a (forward, backwards) weight. We assign cred scores to the nodes in the graph based on a modified PageRank algorithm; please be familiar with the PageRank whitepaper.
- We convert the graph into a markov chain, using the edge weights to determine the transition probabilities between nodes.
- We use the node weights to set the seed vector for PageRank; thus, you can think of each node as “minting” fresh cred in proportion to its weight
- This gives us a probability distribution over the nodes. However, we want cred scores, not raw probabilities. To fix this, we define a certain set of nodes as “scoring nodes” (i.e. all the user nodes). We then normalize scores so that the sum of all scoring nodes’ scores is equal to the target amount of cred, which by convention is the sum of all the node weights.
The above is the core idea of how we calculate cred. However, it’s actually a bit more complicated, because we want time-varying scores; i.e. we want to calculate the cred for each node (contributor or contribution) in every individual week of the project’s history.
We do that via the (reasonably well documented) timeline pagerank module. The basic idea is:
- We start with a weighted Graph that describes the whole history of the project
- We partition the history of the project into week-long intervals
- For each interval, we create an interval-specific graph where:
- Every “new” node or edge (created in most recent interval) has full weight
- Every “old” node or edge has exponentially decayed weight
- Every “future” node or edge has 0 weight
- Then, we calculate cred for each interval-specific graph, giving us an interval score
- A node’s total score is the sum of its interval score for all intervals
I hope this is a useful start – please ask questions and I’ll continue to elaborate. Also, I’m currently working on a refactor to make the WeightedGraph an explicit data type, which should make the code less convoluted and easier to follow.