Idea: Training dataset, labeling data

Preface

So I think to some extent you can try to approach SourceCred’s goal with machine learning. You’re trying to find a function, where you input contributions and community interaction, and out comes a map of cred scores for each of these contributions and interactions.

This idea isn’t about an initiative to build SourceCred-ML, especially because I think it’ll come with it’s own problems. But an idea I had from going through the thought of doing this.

The need

One of the prerequisites for even attempting a machine learning approach to finding such a function, is a high quality dataset for training and ways to evaluate how well the model performed.

We have got easy access to input data. Just saturate your GitHub and Discourse API rate limit for a few days and you should have more than enough data. But that doesn’t make it suitable for training.

  • We don’t know how to evaluate outcomes.
  • We don’t know if the data is biased (for example 80% code projects).

Current approach

Right now the approach to refining our SourceCred function, is more or less to run experiments, find out how the people involved felt about it, maybe with some bits of data, but more likely with discussion posts / chats, improve our personal understanding from that and use intuition to move forward.

I think this approach isn’t going anywhere and something we should keep doing. CredSperiment is a great example.

The idea

However, what could we do to build a high quality dataset? And would this be valuable even if we’re not going down the machine learning implementation route?

Some thoughts:

  • Should we ask people to manually label data from projects they know?
  • Should we collect feedback from live SourceCred installations (like weights used or quality rankings)?

As for it’s value:

  • Could we use this data to test, for example new cred minting ideas?
  • Which weights we should use?
  • Finding types of projects that perform well / poorly with a particular algorithm?
1 Like