Preliminary CredSperiment Cred

Greetings, credizens! I am very pleased to share a rough draft of SourceCred’s own cred, for the CredSperiment.

In contrast to the instances I shared in the CredSperiment Progress Report, we now have a combined instance which properly resolves identities across GitHub and Discourse.

You can check out the instance here. For posterity: If the url is down, you can also start any http server in the docs subdirectory of this commit.

My analysis of the scores

Here are the scores from the new instance:

On the whole, I think the scores are pretty reasonable; at least as an ordering, they correspond reasonably well to my intuitions about who has been contributing to the project. (Although some very important contributors, like Juan Benet, are missing entirely, because they contributed offline.)

Weight Tweaks

It’s not entirely a happy accident that the scores reflect my intuition… I tweaked the weights!

Mostly, the weight tweaking consisted of pushing down the GitHub weights and increasing the Discourse weights, as I felt that Discourse contributions were undervalued with the default weights. You can click “show weight configuration” in the top-right to see the weights I used, and you can change them and recompute if you want. Feel free to suggest different weights on this thread!

image

(If you want to see which weights I changed from their default values, you can take a look at the weights.json file.)

Scores for different contribution types

A deeper way to explore how SourceCred is doing is to look at the scores not just for the users, but for the contributions themselves. Ideally, when looking at the top contributions within a category, you’ll immediately say “yeah, that stuff was all really important!”. If SourceCred is struggling, it will seem more like a random selection.

GitHub issues

Starting with the positive, I think it actually does a pretty good job of identifying important issues on GitHub:

Those top issues all correspond to really important features in SourceCred. The scores are working well because I tend to create a “tracking issue” for a broad area of work, and then reference that tracking issue from each pull request that works on it. SourceCred detects the reference edges, and thus flows cred to important tracking issues.

Discourse Topics

However, it struggles a bit more with identifying important Discourse posts:

Many of these top topics (including the top 3) are by users who have very little engagement other than posting a topic; they have very few “out-edges”, and as such their cred gets stuck in a self-referential loop. I discuss this in more depth in Sneak peek: SourceCred Discourse Plugin. I also have a plan on how to quantitatively detect when this is happening, and a fix for the bug; more to come later.

That said, the Discourse plugin is still missing a vital feature: reference detection. The important posts and topics around here tend to get referenced a lot; we don’t track that yet, but it’s pretty easy to add. I expect this will improve the cred quality a lot.

GitHub Pull Requests

SourceCred does even worse with identifying important pull requests:

Honestly, this might be no better than random. The issue is that for most pull requests, SourceCred has very little information about how much it mattered; we’re not yet looking at which files it touched, or module dependencies, or even the number of lines of code changed! There are likes, true, but many important pull requests go unnoticed and unliked.

We’ll need to get better at this. Personally, I would really like to create a way for trusted contributors to directly opine on how valuable different pulls are. That could be through the “boost” mechanic, or as an even faster fix, @wchargin and I could assign manual weights to every pull request.

Scores by Domain

As an added dimension of analysis, we can take a look at the GitHub and Discourse sides of cred separately.

GitHub

In my opinion, the GitHub cred shows clear limitations of assigning cred based on activity levels: after William stopped focusing on SourceCred full time last November, the total level of activity on GitHub dropped dramatically, but the rate of value creation did not drop nearly that fast. However, since I was doing most feature work on my own rather than with William, there were fewer comments, fewer reviews, etc, which means less cred in the current system.

On the Discourse side, it looks pretty reasonable overall:

Though it stands out to me that @nayafia has one of the highest scores, despite only having two posts. Like I mentioned above, this is because PageRank goes a little crazy for people that have essentially no out-edges, and I have a plan on how to fix this.

Takeaways

As we’ve explored, these scores are imperfect. But: I think they’re also the best scores that SourceCred has yet produced! Integrating the Discourse plugin lets us recognize a host of really important contributors who were going un-seen when we only used GitHub data.

The key question now is: are these scores good enough for us to inaugurate Phase 1 of the CredSperiment, and start paying based on the topline scores? In my opinion, the answer is “yes”, so I’m planning to go ahead with calculating payouts in time for the first week of October.

Please post your thoughts, your concerns, and your alternative weight configurations!

5 Likes

Short Crude response because on the train.

Plus looking at PR #1288 that I have some context on. I think shows a sensitivity to convention, or the breaking of that.

For #1288 a lot of cred I assume comes from me and Vanessa commenting on it like crazy in a very conversational troubleshooting way. A conversation that may be pretty normal, but in other cases may occur on Discord for example and generate not nearly as much cred. I wonder where this PR would land if all conversation comments were nerfed. Not to downplay the value of the PR though, it was several days of work and docker is the primary way I use sourcecred through now. Just suggesting this might have inflated the PR over other ones with similar value.

Likewise I think cred now represents a lot of contributor interaction vs value derived from the software. Imagine feeding docker download numbers into this PR (and others that touch on the docker build pipeline) contrasted with the future npm module. I think could be an interesting factor to measure the value of having published there.

1 Like

Same thing about convention when looking at what’s happening right now at https://github.com/sourcecred/sourcecred/pull/1290#issuecomment-533066244

This will attribute cred to the PR and related commits, while what’s happening is a broader discussion. Just don’t want to be “that guy” who goes, you’re off topic doing this in the PR closed

It’s a break from convention and will need moderation to go back and fix from an attribution standpoint.

@Beanow you bring up some very good points. Conceptually, I think we’re finding that “amount of activity on a PR” is a very bad proxy for “value of the PR”, since lots of activity is more likely to correspond to conversational debugging, controversy, or perhaps bikeshedding–none of which are robust indicators of value.

A better approach may be to have the value of the PR be independent of the activity, but the cred from the PR flows out based on the activity. So a PR that merges without any fuss gives most of the cred to the author; a PR that involves a long conversation prior to merge splits the cred between the participants. Of course this would introduce its own issues (e.g. people incentivized to bikeshed so they can take cred from a PR) but for now I think it would be a better heuristic.

We can implement this by making change to weights; most directly, by changing the “has child” weight to 0. That means, PRs (or issues) will no longer get cred from their comments. If the “has parent” weight is above zero, then PRs (or issues) still give cred to the comments, they just don’t get any back.

If we do so, then the docker PR falls off the top of the cred chart:

I do think the docker PR should get plenty of cred, but the reason is because it was very important for some end users and made it easier to use SourceCred, not because it involved a lot of activity on GitHub. :slight_smile: I think the best approach will be to come up with a simple system for categorizing PRs by kind (“feature-work”, “bugfix”, “refactor”, “documentation”, “build”, “misc”) and perhaps by difficulty (“trivial”, “easy”, “medium”, “hard”). Then we can categorize the PRs manually, and assign weights to each of these types.

Here’s how the cred looks with the edge weight change applied:

@vsoch still rates quite highly; in part this is because while I changed the edge weights, we still have a pretty high node weight on GitHub comments. Arguably the weight should be close to zero, since we don’t value GitHub comments in and of itself, we mostly just value the changes to code that they participate in. Here’s the scores with the GitHub comment weight set to zero.

Just looking at the relative ordering, I think this might be an improvement over the scores in the first post. Not to downplay @vsoch’s contributions, but both @mzargham and @s_ben have been considerably more involved in the project, so it makes sense to me that they would be ranked higher.

I think this is a good example of where SC can piggyback on common practices that are good proxies for importance. Another which comes to mind is the “thank you” email you see in companies following a big launch or announcement. Those emails are generally very information rich, and largely about publicly giving credit. Not an applicable example here, but have been thinking about that one and just wanted to drop it somewhere.

In a similar vein, what about mentions (e.g. @decentralion)? Those seem to convey some cred as well, if less than post references.

Side note: when I posted the above, I got a Discourse notification warning me that I was double posting the link to the boost mechanic and reference to wchargin…

I agree human input/curation could be valuable here. I do think that once we start playing the game, with money on the line, if people know :+1: = :money_with_wings:, that will likely affect their behavior, causing them to use more likes/emojis/etc. Just like we’ve noted we’re already altering our Discourse behavior. That actually could be a good mechanism for getting input, even though it relies on a little behavioral change. The boost mechanic could be good too…as a contributor that has some mana laying around, I could see spending a little to boost a post…

In my internal imagined version of SC that is workable today IMO, the regular distribution of money/mana according to cred would basically fix this. When you and William are both contributing, the activity (and therefore cred) is higher, which means more mana to split between you two (roughly evenly apparently). Then when it’s just you contributing, there’s less cred generated (and therefore less mana), but you’re getting all of the mana. So the distribution is about right. And solving that piece alone is massive. Also, with a boosting mechanism, you could inject that mana into the graph later when more contributors arrive, giving yourself more cred or mana (your choice).

Looking good! Let’s go!

I think that the scores are looking pretty damn good. Maybe they need a lot of manual tweaking right now, but having a small group of contributors interacting with them will hopefully provide useful feedback that can drive design decisions/experiments. Re: the CredSperiment Phases, looks reasonable, though I wonder if building the infrastructure is necessary before we start experimenting with mana and boosting. In a production system/DAO, the infrastructure would be necessary. But while we’re experimenting, and the stakes are relatively low, I’d be fine with everything just being on a spreadsheet with manual payments. That’s just me though :airplane: :money_with_wings: :upside_down_face:

1 Like

I think it’s easy to be fine with shoddily implemented system when you won’t be the one responsible for keeping it up and running :wink:. I worry about driving myself crazy keeping some complicated ad-hoc spreadsheet up to date, which is why I want at least basic infrastructure for doing so, and keeping the history in git.

One of the questions for prioritization will be: to what extent do we want to speed ahead getting to future stages of the CredSperiment, or to what extent do we want to improve the quality on the software so far? E.g. make it dramatically easier to set up and manage a SourceCred instance, improve the UI so it does a better job of showing where cred came from, etc. I’m planning to write a roadmap document in the coming week which goes in depth on these tradeoffs, and will be happy for reviews and feedback at that time.

1 Like

Most critical for real money scenarios imo is to find the requirements for and designing a ledger. The most urgent quality improvements may be derived from that to an extent.

A place to look for ideas is “event sourcing”. It’s a pattern where your source of truth is not mutable tables (the typical DB design) but an immutable array of events. It’s particularly well suited for situations where you need auditability. And using it should be familiar if you have used for example redux for reducing events to an app state.

1 Like

My plan for the CredSperiment v1 is to write a very simple ledger with the following properties:

  • auditability, the ledger contains an immutable record of every payout and transaction (history stored as JSON files in git)
  • integer balances (store everything as integer # of US cents, with rounding down when computing the payouts to individuals)
  • part of the main sourcecred/sourcecred repository (for convenient integration with the types, apis, and build system, might split it out later)
  • a very simple UI addon to SourceCred which displays every contributor’s total balance, and maybe how much they received each week

I’ve done some experiments with writing JS ledgers before (‘cryptofolio’) which used BigNums for arbitrary precision, which is important for tracking cryptocurrencies, but not really needed if we can just treat US cents as the smallest unit of account.

1 Like

Reproduction or explanation is important too. It’s one thing to show week N we awarded person Y amount X. But did you just pull that number out of your @&#% or what? Did you use the agreed upon weights? Which graph did you use? So maybe add snapshots of the data? Precision is a good point and even for ints tends to need extra work in JS. The number of times I’ve seen an “integer” 43 jump to 42.99999999 is crazy.

Anyway lesson from my digging into event sourcing is, modeling the events correctly is tricky but absolutely worth spending time on. Tools to display this data imo is good case for dogfooding. Just need to use the same tools for building the ledger, validating and managing payouts.

3 Likes

Going back to the quality of scores. I think we should take a little step back from technicalities and look at the conceptual level.

The big ways of valuing that I know are:

  • Intrinsic value of :muscle: honest work
  • End-user utility📈

The first is important especially for invisible work. Like security patches. Where the ideal outcome is that nobody is ever affected by the vulnerability. And most work is done off the record. Your average user wouldn’t be able to rate its value.

The second is critical for the “reevaluating” property of sourcecred. The same bugfix for a feature nobody used beforeiit was removed or the core of the software used by all end users over the years.

Also I think there is a dependency / catalyst relationship. “International adoption would not have been possible without translations” or “Developer adoption would not have been possible without documentation work”

These I feel should be present in the algorithm.

2 Likes

This would really be a special circle of hell lol

Good call :+1:

Maybe we should find a way to incentivize community feedback so that as people play the game we get a constant stream of data/feedback to tune the game itself?

Ditto!

I’m fine with anything that gives us more data and experience to improve the model. The quicker we can make the model better, the quicker we can get more people to play, and the more people play and in more contexts the more data we’ll have, the better the game will be, etc…

Yeah that too

I’ve been contributing to discussions because they’re awesome and I’m curious about this project. Now that the system is going live I’m more curious. I actually want to run it and take it apart and figure out how it works (maybe even contribute!). This is driven by the fact that I’m moving from being an observer/commenter to an actual participant. Now it’s real. I have skin in the game and an incentive to participate more! :slight_smile:

Yes please

Agreed, but how would we measure that? Is it possible to really know how much international translations or an improved UI affected users vs a marketing campaign that launched at the same time? Often a team subjectively measures this, but could we really quantify this to be directly integrated into the SourceCred alg?

I would like to suggest one bit of nuance to this. I feel like the value of a PR is:

  • The value of the commits (accepted or not).
  • Having gone through the collaborative process to improve it’s quality and support base within the community.

Hence I feel like a “no fuss” PR being valued the most does not incentivize the collaboration side of things. PR reviews in particular I feel are valuable, but I do think there is a point where there is a diminishing returns effect. 10 reviews does not add much over 3 reviews and 100 comments doesn’t add much over 10 succinct ones.

So I wouldn’t encourage a system of: each PR is 5 cred, share between all interaction. I would suggest the first 3 reviews and 10 comments are additive. From there on it’s split between interaction or some kind of diminishing returns curve for their value. Incentives wise I feel like this creates space for dedicated / regular reviewing roles. Of course, ideally this would be something you can give weight to suit your community.


I get where this is coming from, but foresee a number of issues with relying on this too much to get to “pretty good” scores. One is the obvious politics / governance of having to make these judgement calls what is easy or hard. Launchpad for example has a “bug heat” score to compliment a traditional category assigned by someone with triage permissions to make sure not all of this falls to people with authority. Another is that it’s hard to let this be an exhaustive list and thus is not great for unexpected contributions. Finally it comes back down to having to judge the value of these things either ahead of time or around time of merging.

But I do like some aspects of this. A minimal set of “coloring” nodes so they can be weighted differently. I mentioned security before, but this is difficulty to quantify both with metrics and judgement calls as it requires expertise. A minimal coloring of a PR as security let’s you boost it if you like. For most contributions though I would rather look for a “reevaluating” metric instead of classifying them. Such as number of commits the code remains unchanged, or usage statistics of the features it touches on. This coloring doesn’t need to be just manually on github though. It could also mark portions of the code as being “core” and automatically color PRs and it’s interactions from that. For example to boost reviews for PRs against the core.

Bug heat

Launchpad helps you to appraise a bug by giving you a calculated measure — called bug heat — of its likely significance. You can see bug heat in bug listings, and also on individual bug pages, as a number next to a flame icon.

Here’s how Launchpad calculates the bug heat score:

Attribute Calculation
Private Adds 150 points
Security issue Adds 250 points
Duplicates Adds 6 points per duplicate bug
Affected users Adds 4 points per affected user
Subscribers (incl. subscribers to duplicates) Adds 2 points per subscriber

You’re completely right. And of course this is why proxies are being tested. It’s objective and not difficult to obtain github data. While accurate tracking of usage and other such stats is a lot more difficult. I’m suggesting it more as, these are the lines of thinking I would like to see represented in those proxies if we can’t represent them directly.

One key point for example is, I feel “pure meritocracy” does not fully capture the “honest work” part in some situations. And honest work is subjective in evaluation. Is it how your skills rate in the market? Is it final impact? Something else?

Also tweaking values until they “seem right” may be required to some extent (the cred historian role as example) but for development of SourceCred I feel like they should be explained in terms of what it’s a proxy for.

This is true, especially if you take into consideration all the thought and experimentation that was invested before an action was taken. Hard to measure time spent offline thinking, reading, talking, writing, editing, and testing things out. Kind of like how it’s a lot harder to write something short and good than something long and rambling.

1 Like

Yeah, we’ll do them too. From an implementation standpoint they’re nearly identical. We’ll keep the edge weights separate though, so that we can give them distinct weights.

This assumes there’s some kind of fixed relationship between rate of cred generation and rate of mana generation. Currently, we assume a semi-fixed relationship between mana and $ entering/exiting the system (so 1 mana = $0.01), but cred is free-floating, which implies that there is no fixed relationship between cred and mana. That said, most of the mana stuff is still pretty wide open. Having a fixed cred-mana relationship could be really nice, but having a hard value to mana is nice too.

In economics, the Impossible trinity states that it’s impossible to have the following 3 properties at the same time:

  • a fixed foreign exchange rate
  • free flows of capital
  • free-floating interest rates

I think we’re going to find a similar result for valuing cred/mana. Will be interesting to discuss what tradeoffs we want.

What $\alpha$ parameter are you using? Generally a higher teleportation probability will help resolve this since it is technically equivalent to adding outbound edges to the seed vector.

2 Likes

Too tired to absorb more economics right now, but just wanted to say that one question central to my thinking about this is deciding which variables to set via free markets (i.e. foreign exchange rate?).

Makes me think of this study recently that looked at private Bittorrent sites as markets. https://medium.com/@jbackus/private-bittorrent-trackers-are-markets-1d3cc3c9bacd

The bullet point findings jumped out at me as dynamics we could see in SC:

  • Upload/download ratio requirements are a currency system in disguise. Users are just trying to maintain a positive balance
  • Private trackers extend loans to new users to give them time to maintain a good ratio
  • Users with slow internet connections have lower “earning potential” and they work more hours to match that
  • Central ratio requirements price every file equally, distorting the market and suppressing supply and demand
  • Status and altruism motivate excess earning

The authors of course critique it and say a free-floating market price would better determine some things, but I think it’s exciting just to see a working example. And one that is also a decentralized protocol with pseudonymous participants.

1 Like

It’s set to 0.05. I just merged a PR that makes this visible and changeable in the weights configuration. I think I’ll switch the default alpha to 0.20, which decreases sensitivity to these issues and (IMO) improves the cred quality.

1 Like

I love this article! It’s one of my favorite things I’ve ever read because it got me thinking about mechanism/incentive design in a more abstract and general way :slight_smile:

If you really think about it, it’s just the exchange of energy. Where energy = time, effort, and/or money. So in any system where you’re exchanging value economics and game theory still apply, but money is just the most well studied and measurable version of this. With token models, however, you’re measuring and modeling reputation, governance/poewr, financial value, access to goods/services, etc… Much more interesting and much more complex!

Are there any other resources you’d recommend to explore modeling and designing incentive mechanisms?, esp in a more social design space?

So @mzargham’s cadCAD system he’s using to simulate game theory dynamics, including for SourceCred, is generally based on the principle of energy concervation, how much energy different actions require, how it flows around the system. I’m not doing it justice, could be worth checking out the above talk if you haven’t.

So many other examples I’m sure… but the bittorent paper is the only one I had an aha! moment with. Surely there are examples with video games (which I don’t play), and @decentralion is a big fan of MMOs. So we may end up working in one:)