Maintainers User Stories discussion

Building off the office hours discussion (notes here: Office Hours Agenda 5/16) and discord discussion with @decentralion and @s_ben, this thread is to kick off ongoing discussions, and refinement of our understanding of the needs of maintainers.

After talking to a maintainer of a project interested in implementing SourceCred, and discussing other dev metrics generally, the main requested feature seems to be the ability to manually adjust weighting of a contribution. There are inevitably some contributions that are simply badly weighted by the current algorithm. E.g. someone that adds a bunch of lines of documentation receives the same credit as someone that writes a ton of code. Someone who deletes a bunch of old deprecated code that wasn’t being used anyway, receives a bunch of cred. Long, drama-filled, or simply unhelpful comments could be rewarded, etc. Cred seems to work “pretty good” out of the box, which is a great accomplishment. But getting to a point of “wow, this realistically reflects the reality of the project”, I think requires this feature, at least until more robust heuristics come online in the future (which could be a long time).

Another requested feature is time-based cred. In some projects, cred will be dominated by, for instance, people who are no longer with the project. Or a maintainer or community may want to somehow focus on recent contributions, or new contributors.

Having written executive level analytics dashboards, front line manager performance metrics of their staff tied to pay, and other operational metrics, you can’t fix cultural problems by tweaking the metrics.

By cultural problems, I mean how people write code, how people interact via messaging services, and how people write documentation. Itemizing cred by contribution may temporarily solve a calculation problem in the short-term, but it won’t change a contributor’s way of contributing. I get why someone might request this (and perhaps I’m misunderstanding your comment), but if the examples you listed are true that sounds more on the project/entity’s culture than on SourceCred or any other attribution framework.

Put another way, what role should Source Cred play in “normalizing culture” of a project or organization via weighting contributions or contributors? I’m not sure I have the answer here, but I’ve not seen a situation where any kind of attribution/performance measure fixed a bad apple. I have seen staff file grievances against management (including union backed filings) over unequally exercised standards. For the record, it involved my work AND the union employees were definitely right in my opinion. Interested to hear others’ experiences here though!

Definitely this - as an option. I can see wanting the whole project weighted equally for some things and more current contributions weighted more heavily for other things.

@mzargham - feeling bad I missed this meeting - very interesting topic!

Your experience is very valuable in this context. Glad to see this answer. When I proposed using more traditional dev metrics (not tied to pay) in a project I currently contribute to, I got a very negative reaction from some of the devs. Presumably because they have had negative experiences.

I get why someone might request this (and perhaps I’m misunderstanding your comment), but if the examples you listed are true that sounds more on the project/entity’s culture than on SourceCred or any other attribution framework.

Some of my imagined uses I don’t think are cultural. For instance, weighting documentation in source files (which I’ve written a lot of) differently than lines of code, I think will be common across cultures. Presumably most (if not all) projects will value different categories of work differently (however they value them; they’ll have a knob to tweak that). But at some level, yes, this will be used to ‘enforce’ or ‘encourage’ certain cultural values. I would argue that goes on in the course of any organization already.

Put another way, what role should Source Cred play in “normalizing culture” of a project or organization via weighting contributions or contributors? I’m not sure I have the answer here, but I’ve not seen a situation where any kind of attribution/performance measure fixed a bad apple. I have seen staff file grievances against management (including union backed filings) over unequally exercised standards. For the record, it involved my work AND the union employees were definitely right in my opinion. Interested to hear others’ experiences here though!

So dev metrics can and have gone horribly wrong in the past. I think it’s naive to think SC won’t face some of the same issues. “Normalizing culture” I think is an issue for all widely-adopted technologies – e.g. making English, the language we’re using-- more dominant, not to start that debate). In regards to unions, grievances, etc., I think SC will see that. I don’t think that’s bad, per se. It’s just going to require governance to come into it. I personally think no crypto project is going to sidestep governance anyway. As for whether this is fair, I think a tool that is actually able to measure contributions, could be a tool for more fairness, for all involved. One idea here would be to just introduce governance/voting into the weighting of cred. For instance, one can only nudge a weight in proportion to cred you have earned in that project.

3 Likes

This is one of the main focuses ongoing research. It’s nuanced so there is not likely to be a “tada here is the answer” in some global sense, but we’re using numerical experiments to explore the way certain time dependent rules can result in different characteristics for the metric over time:

the thing that makes this research interesting and challenging is that you are computing a metric over a time varying graph, graphs sufficient complex objects that continuous metrics still change in unintuitive ways as they evolve in time.

@ryanMorton @s_ben :
Here are some snaps from the report to demonstrate even the basic pagerank algorithm changing overtime given a relatively simple graph formation process.

Graph forms iteratively:

Summary of graph growth

PageRank calculation for each user

Graph contains both users and contributions by users but the below contains on the users cred. The seed for the graph formation was a subgraph of GitHub - sourcecred/research: Repository for research-related items on SourceCred's agenda which is where the real user names come from. For more of a dive into the research, feel free to jump into the “Temporal Context” Section, “Computational Experiments” subsection of the report.

There will be contributed research on this topic for sure. At the moment, working on collecting, digesting refining the work to date. I plan to write a medium article on this topic based on the research to distill it further. I will post a draft on discourse for feedback before it goes out.

2 Likes

I think this level of weighting types of contributions may already be envisioned. I think I read your earlier post as weighting a single contribution versus a whole type of contribution. In which case, I agree this is a good way to go!

1 Like

I am arguing for the ability to weight single contributions as well, which I can see could be controversial or problematic. I see that as the quickest way to iterate to a realistic reflection of reality, which would be a valuable starting point, in terms of showcasing SC and also as a source of feedback data that could be used to train more generalized heuristics that avoid controversy). Weighting by type (e.g. recognizing documentation) I believe is planned.

2 Likes

Hot off the presses, since this discussion kicked off in Discord a few hours ago (thanks @s_ben!) I’ve implemented a prototype of manual node weights. Here’s a screenshot of how it looks (feedback welcome):

@ryanMorton, I’m curious to better understand what you mean here. As an example: I really like for pull requests to have a test plan, documentation on any new methods that are created, and nice commit messages. Suppose that I start assigning a weight to every pull request based on how it met those 3 criterian. I think this would change how people contribute to SourceCred–do you agree?

Aside: We could think of this as extending code review, so that rather than only being about approval to merge, it also includes a cred weighting. I like this idea, especially if we can find a way to engage the community so that it’s not only the maintainer who gets to make reviews.

(The nice thing about having the maintainer do it is that if we are willing to trust our maintainer, we don’t need the review mechanism to be robust to gaming. Trusting the maintainer is a reasonable short-term solution, but we’ll need more robust designs in the long term.)

This also gives me an idea for another interface for setting the weights. We could set it up so that if someone with maintainer permissions writes @credbot set weight 2 in a comment, then the parent issue/pull gets that weight. This would make the process of weight setting more legible to contributors because it happens on GitHub, probably along with associated explanation and justification, meaning it would also be more effective for shaping culture.

I’m with @s_ben on this one. This is a really high leverage addition to SC, in that it gives the tool a lot more flexibility right now, and will enable a much higher quality cred attribution for people who are willing to put in effort. We’ve always envisioned a rich “heuristics system” which would compose many different sources of signal about how valuable contributions are. The manual weighting basically lets the maintainer sub in for the heuristics.

I’m also still excited about the “Odyssey Plugin”, which will allow users to manually create new nodes and edges in the graph. This will enable recording offline contributions (currently impossible), and will be a much richer way of signal boosting a contribution. Adding weight to a node makes it more important, but doesn’t communicate why it’s more important. Connecting a node to a value or goal does communicate why it’s important, and enables further analysis like finding scoped cred within a particular goal or value.

That said, I think there may be a place for both in the long run. One big benefit of having manual weights is it gives the maintainer the ability to stop cred spammers very effectively. If someone is an out-and-out attacker on your cred, you can just set the weight on their user node to 0, and thus ensure that they don’t get reward. (There are more sophisticated attacks where setting the user weight to 0 wouldn’t necessarily work, but this is still a great first-order defense to have in the toolkit.)

1 Like

This is intriguing; i like it a lot. I think power a balance between maintainers and contributors is critical. Of course theres always ultimate option to fork but i think that should be an action of last resort if maintainers and contributors really cannot coordinate effectively.

In order to really study how this might affect the balance, I need to spend some time thinking on how exactly to incorporate it into my computational experimental apparatus, but i am pretty confident there would be a good way to pull this into the strategic game framing of decisions.

For reference, here is a drawing from the report was my first framing of the strategic variation of the users. Adding this in would be a 4th pink diamond and the second action profile for the contributors (rather than the maintainer which already has the cred release decision and parameter selection decision).

They did the work, whether it met the criteria or not. Or is attribution/contribution also a question of quality and/or the maintainers expectations? I think maintainers have existing mechanisms to ensure quality and practices that currently work.

The only other thought I have to share: I like the heuristics because it’s more of a policy that can be known, monitored, and verified systematically. Scoring contributions one by one is more personal/prone to conflict, more difficult to monitor, and I’m not sure what if anything could be verified.

Prototype looks great @decentralion! With this, I think I have the levers to create some customized instances that will be reflective of reality (my goal). I’ve got a maintainer of three active repos that is interested in trialing SC. Going to show this to them and hash out next steps. My goal is just to work with an instance or two until the cred weightings seem to reflect reality for all involved, hopefully to the degree it offers new insights, then share data/feedback here.

@mzargham glad to see how much thought you’ve put into the question of power balance between maintainers and contributors. Haven’t had a chance to dive into your work, but this is emerging as a key focal point when I think about this stuff. Your chart seems to capture the more important elements at play here. As for your work on time-based cred, way out of my depth, but it sounds very interesting. If you have any code that can generate any meaningful time-varying cred scores, would be interested to try out. Otherwise, am probably going to just try running SC forward in time to get a rough measure, as @decentralion was discussing on another thread. In my head, a chart of cred per contributor over time is what makes the most sense, in terms communicating the underlying reality in a way that makes intuitive sense.

@ryanMorton

They did the work, whether it met the criteria or not. Or is attribution/contribution also a question of quality and/or the maintainers expectations? I think maintainers have existing mechanisms to ensure quality and practices that currently work.

From my (albeit limited) experience working in OSS, my understanding is that maintainers generally do have rules and expectations, and that they are generally necessary. Otherwise it creates too much work for them to integrate submissions. I think the goal should be to automate enforcement of these rules (perhaps through cred scores), so that the maintainer has more time, perhaps for reworking submissions not up to par.

The only other thought I have to share: I like the heuristics because it’s more of a policy that can be known, monitored, and verified systematically. Scoring contributions one by one is more personal/prone to conflict, more difficult to monitor, and I’m not sure what if anything could be verified.

I sense that heuristics are what drives scores more in the future. And that the open plugin architecture will enable that. My question is, what signal train those heuristics? If we assume an honest/accurate maintainer (a fair assumption when working with early adopters), then their direct measurements, fed into SC until a realistic picture of reality emerges, I think would be a good training signal. Also heuristics, especially if they’re transparent, will be gamed. A human in the loop somewhere (even if just in the training of the heuristics) may be necessary to curb that.

1 Like

I have working code that can pull data from SourceCred data dumps thanks to @decentralion. I also built some utilities that include specifying the heuristics by type, and I am using the networkx python package for the graph data structures so its relatively easy to just add a field for custom weight and have it supersede the type based weight. The pagerank algorithm implementation in python is a bit more general than the production one so I can explore options and the games framework involving the interaction between rule and behaviors is quite robust. My team has been building and using it for a year.

Downside is that while I’ve put a lot of effort into the experimental apparatus the codebase is not well integrated into the GitHub - sourcecred/research: Repository for research-related items on SourceCred's agenda but we’re working getting that cleaned up. In the long run, I’d very much like to support a flight simulator for maintainers based on some of their decisions. I’ve done a lot of research on networks social and economic games over the years and my firm does rigorous multi-stakehold multi-mechanism ecosystem design using these methods and tools. However to @s_ben point this is not immediately useful until someone can just fetch the code and play with it themselves.

@s_ben In my experience you need to do this anyway. For most of these systems there is a natural tension between what feels right in an immediate sense and what produces the desired emergent properties. Since the emergent properties are properties of feedback and interaction effects they rarely conform to expectations, most systems end up with emergent behavior that is the consequence of only what feels right in a local sense.

In order to design robust multi-stakeholder multi-mechanism ecosystems one must do both the behavioral economics element (which is the part you are doing when you deploy the prototype and learn from using it hands on) and ecosystems engineering (which is the part where one focus on the emergent properties taking account many possible behavior patterns and learning about the interaction effects between the actors, and the sensitivity of the outcome patterns to the design choices available).

I find it easier to reason about like ecology and evolutionary biology rather than economics and computer science. Even applying these methods in robotics, aerospace and defense settings, the buzz word is “bio-inspired” because these approaches are the ones that have been shown to work most effectively in unstructured or open ended problem spaces.

When you think about it, we’re empowering open source maintainers to do real time game design on their project; this can epically accelerate but it also can produce some pretty toxic environments (even with the best of intentions). My goal is to help us create more of the former and less of the latter, through understanding and guidance rather than by heavily restricting the maintainers configuration choices.

Don’t let my zeal scare you, I have been working on a variety of multi-agent coordination problems since 2003/4 and I completed a PhD on the subject, so I am not asking that we have maintainers follow along with the theory step for step, but I do have 2 major agenda items that face maintainers:

  • Continue to write short articles explaining concepts which are important for maintainers to understand if they want to make use of the finer grained controls in SourceCred (e.g. this article Exploring Subjectivity in Algorithms | by Michael Zargham | SourceCred | Medium which really focuses on alpha)
  • A research codebase that provides apparatus for exploring decisions sourcecred instance configurations, implications of behavioral norms and contributor conventions and actual source cred data more deeply with the help of @ryanMorton and others.

I am actually working on another short article today which aims to explain the core concepts in the formalization of the network formation game that are in the diagram above, and which are covered a bit more deeply in the paper I shared earlier in the thread. It’s my hope that for the time being, those research backed medium articles will be valuable for you and others as you experiment with SourceCred instances.

I’d thrilled to have continued feedback as you play with the awesome prototype @decentralion built and what you and others find will continue to inform decisions about how both contributors and maintainers are modeled, what experiments I actually run, and what results I distill for a broader audience to benefit from.

Okay so its not ‘Short’ …

I did however go to great pains to minimize the math and technical jargon, focus on explaining important concepts and included a pretty deep set of references to existing literature for anyone who wants to read more.

@mzargham love the zeal. Much to digest!

Downside is that while I’ve put a lot of effort into the experimental apparatus the codebase is not well integrated into the GitHub - sourcecred/research: Repository for research-related items on SourceCred's agenda but we’re working getting that cleaned up.

Link to your research repo? Could someone with basic python skills get it up and running?

I’d thrilled to have continued feedback as you play with the awesome prototype @decentralion built and what you and others find will continue to inform decisions about how both contributors and maintainers are modeled, what experiments I actually run, and what results I distill for a broader audience to benefit from.

Looking forward to playing with this stuff. Was totally having similar ideas on my own. It’s great to find something like this that’s light years ahead of what I thought was even possible. Next step is putting up a hosted instance for a maintainer to play with/offer feedback on. Bandwidth is currently fairly limited, but trying to figure out a way to carve out more time for this.

Research Repo i used to speed up progress is an internal one, but I am migrating code out into the sourcecred/research repo. Most likely tomorrow; I already have some time set a side for cleanup.

I’ll tag the folder in this thread as soon as i get it moved.

Cool. Python is my jam (as far as my coding skills), so this could be really helpful.

@decentralion, have decided to bite the bullet re: learning f*#@)ing JS to work on this, so will be working with prototype too.

1 Like