Analyzing SourceCred data

I’m dusting off my python here to do some analysis on cred scores. Thought I would ping the community for input before I start coding. Want to make sure the data structures and analysis functions I write are aligned with the needs of the broader community, and that I don’t paint myself into any corners. I’m OK with this stuff, but no data scientist. Namely, these questions are coming to mind:

What types of questions will people ask of SourceCred data?

There are obviously a bagillion ways to slice this, but are there any burning questions that stand out? Any pressing needs I should be aware of? The main use cases I’m imagining are detecting gaming, estimating cred gained for a given contribution, creating new views into the data (e.g. leaderboards showing contributions vs contributors), and looking for interesting relationships (e.g. causal relationships over time, trust levels between certain contributors, etc.).

What data do we want to analyze?

I’m starting with the raw cred scores in scores.json, but is anyone wanting to dive into more granular graph data? Grain data?

What format do we want the data in?

I’m imagining it will be useful to package data into easily analyzable formats. For example, I’m thinking it would be interesting to analyze Maker forum activity during black thursday, create a post about about it on their forums. That community is likely to want to dig in and play with the data, but tossing them a big gnarly json file with no documentation will exclude a lot of people. Do we want to create human readable csv files? Something else?

Do we want to document current data models?

Do we want to create low-level technical documentation around the data models and formatting? I would find one useful right now myself, but not sure if it’s overkill. Also, are the data objects (e.g. scores.json) stable, or are they still shifting as we add new functionality? I’m imagining creating at least high-level documentation around what data lives in what files, but wanted to get input here before creating any Issues on sourcecred/docs.

1 Like

Some questions I’d be interested in:

  • Whats the cred minted by source? (e.g. is an overweight proportion of our cred getting minted on GH issues?)
  • and how has that changed over time?
  • What is the average cred per node of various types (e.g. comments)? What’s the variance?

Right now, the data we export is extremely difficult to use for this purpose, since the scores file only shows scores per-user.

Overall, we should come up with a canonical data format that makes these kinds of analyses easy. Right now, it is possible to get this information from the data that’s written to disk, but it’s all undocumented, and requires either depending on or re-implementing the TimelineCred class.

Realistically, in-depth analysis (i.e. going beyond raw user scores) is going to be frustrating until after we fix this. We could potentially include this as part of the “Quality of Life” section on our Beta 1 Roadmap.

I’ve proposed a data format we could use in this GitHub issue.

Some questions I think will be use-cases for the new explorer/UI. Having something like “profile pages” and moving away from the current leader boards for example.

Urgent needs I can come up with:

  • Better comparisons of changes. Before / afters like Add first batch of CredCon and Partner initiatives #27, are pretty bare bones. Script here for reference.
  • Weight sanity checking: absolute Cred minted per contribution type.
  • Per user, which contribution types their Cred came from (though seems a UI task).
  • Financial policy input:
    • Given an arbitrary “living wages” number in Grain, how have contributors been able to achieve this over time?
    • Relative inflation rate between Cred and Grain?