Idea: Training dataset, labeling data

Beanow · October 7, 2019, 1:42pm

Preface

So I think to some extent you can try to approach SourceCred’s goal with machine learning. You’re trying to find a function, where you input contributions and community interaction, and out comes a map of cred scores for each of these contributions and interactions.

This idea isn’t about an initiative to build SourceCred-ML, especially because I think it’ll come with it’s own problems. But an idea I had from going through the thought of doing this.

The need

One of the prerequisites for even attempting a machine learning approach to finding such a function, is a high quality dataset for training and ways to evaluate how well the model performed.

We have got easy access to input data. Just saturate your GitHub and Discourse API rate limit for a few days and you should have more than enough data. But that doesn’t make it suitable for training.

We don’t know how to evaluate outcomes.
We don’t know if the data is biased (for example 80% code projects).

Current approach

Right now the approach to refining our SourceCred function, is more or less to run experiments, find out how the people involved felt about it, maybe with some bits of data, but more likely with discussion posts / chats, improve our personal understanding from that and use intuition to move forward.

I think this approach isn’t going anywhere and something we should keep doing. CredSperiment is a great example.

The idea

However, what could we do to build a high quality dataset? And would this be valuable even if we’re not going down the machine learning implementation route?

Some thoughts:

Should we ask people to manually label data from projects they know?
Should we collect feedback from live SourceCred installations (like weights used or quality rankings)?

As for it’s value:

Could we use this data to test, for example new cred minting ideas?
Which weights we should use?
Finding types of projects that perform well / poorly with a particular algorithm?

Topic		Replies	Views
Analyzing SourceCred data	2	846	April 27, 2020
Proposed new landing page prose Site Feedback	8	1405	May 6, 2019
Preliminary CredSperiment Cred The CredSperiment	20	3544	October 6, 2019
PlaceCred: exploring SourceCred for place-based reputation cryptoeconomics , proposal	2	283	February 15, 2024
Dogfooding SourceCred via OpenCollective The CredSperiment	8	3071	December 6, 2019

Idea: Training dataset, labeling data

Preface

The need

Current approach

The idea

Related topics