Experiment: SourceCred stack lookup

Beanow · October 12, 2019, 1:46pm

SourceCred stack lookup

So I’ve started an experiment over at https://github.com/teamopen-dev/sourcecred-stack-lookup (With a lot of feedback from @nothingismagick )

The basic idea is to calculate cred scores for GitHub repositories ahead of time and host the results, so you can do a fast lookup of information about without having to mirror the repository or calculate cred scores yourself.

This approach will save you between a few minutes to a few hours per repository. Allowing you to do aggregate analysis, at the cost of being able to tweak parameters like the weights used.

Use-case: low bus-factor risk for javascript projects

The version right now uses a reasonably simple interpretation of SourceCred scores to find out which of your dependencies might have a bus-factor risk.

It looks at a few things:

Is most of the work done by a few people?
Do the same people show up in different projects as top contributors?
Did a lot of work go into the project?

It categorizes these factors into: Low, Medium, High and CRITICAL impact.

Example: Scanning sourcecred/sourcecred

sourcecred/sourcecred$ yarn -s lookup
Collecting data for sourcecred/sourcecred/package.json
CRITICAL impact contributors at risk from bus-factor found:
- @JoshuaWise, in projects: [ 'joshuawise/better-sqlite3' ]
- @kkaefer, in projects: [ 'joshuawise/better-sqlite3' ]
- @jgm, in projects: [ 'jgm/commonmark.js' ]
- @mbostock, in projects: [
  'd3/d3-scale',
  'd3/d3-array',
  'd3/d3-format',
  'd3/d3-time-format',
  'd3/d3-time',
  'd3/d3-scale-chromatic'
]
- @raszi, in projects: [ 'raszi/node-tmp' ]
- @silkentrance, in projects: [ 'raszi/node-tmp' ]
- @jessebeach, in projects: [ 'evcohen/eslint-plugin-jsx-a11y' ]
- @ljharb, in projects: [ 'evcohen/eslint-plugin-jsx-a11y', 'chrisdickinson/raf' ]
- @evcohen, in projects: [ 'evcohen/eslint-plugin-jsx-a11y' ]
- @coveralls, in projects: [ 'evcohen/eslint-plugin-jsx-a11y' ]

HIGH impact contributors at risk from bus-factor found:
- @springmeyer, in projects: [ 'joshuawise/better-sqlite3' ]
- @Mithgol, in projects: [ 'joshuawise/better-sqlite3' ]
- @SGrondin, in projects: [ 'sgrondin/bottleneck' ]
- @tmpfs, in projects: [ 'jgm/commonmark.js' ]

We encourage you to make sure these contributors receive enough support.

Current NPM implementation

An NPM package is usually open source and likely to have a GitHub link in it’s package.json. Meaning we should be able to crawl it pretty easily. So using this for an aggregate use-case made sense.

Generating scores

There is a cronjob running on my server, which will gradually load a queue of projects and generate score files for them. The sourcecred data folder is kept on the server as cache, but the score files are committed to GitHub pages. https://github.com/teamopen-dev/sourcecred-stack-lookup/tree/gh-pages Also included is a meta file, which lists the available scores and their last updated timestamps.

https://scsl.teamopen.dev/v0/meta.json

The client

There’s also a client, uploaded as a package on NPM: https://www.npmjs.com/package/@teamopen/sourcecred-stack-lookup

It has logic to do the resolving from NPM package names to GitHub repos, using the meta.json file. As well as finding out which scores are currently available. Then downloads the ones available.

(This client works both as a devDependency for node, as well as in Notebooks, like this https://observablehq.com/@beanow/do-things-with-scsl)

Beanow · October 12, 2019, 2:55pm

Some lessons learned

The value of this hosted data approach

Loading the current list of 159 GitHub repositories, took about 2 days to do from an empty cache.

Using the client to download and interpret the scores, is a matter of seconds. That’s saving you incredible amounts of time. This opens up new applications that would otherwise be infeasible.

This idea can be leveraged further by pre-calculating a minimal amount of data needed for your application based off of the scores, instead of having the client download the full scores file and analyze locally. (Though, I think my questionable bus-factor calculation doesn’t deserve this optimization any time soon)

I’m planning to keep this data available for anyone who would like to experiment with aggregated data. If you need help getting set up, or would like to include more repositories in the data set, get in touch!

Data compression

Right now, there are 159 GitHub repos scored. Meaning 159 cache databases, graphs and scores. That’s a lot of data. Something to the tune of 6 GB. And just the scores grew to something on the order of 600 MB.

Pushing a 600 MB commit to host them on gh-pages, and gzip / gunzip of the several-GB SourceCred cache, soon created a bottleneck.

What I’ve done now is gzip the scores before pushing to gh-pages. And have client gunzip them for you. That was a 7x speed increase and similar compression ratio.

Also, I’m now giving every github repo in it’s own “instance” instead of one very large SourceCred data folder. I will gunzip just one instance before I try to load and score it. Then gzip that instance when done. This saves about 4 minutes of needlessly (un)compressing data to temporary storage every hour the cronjob triggers. Because I’m targeting about 40 minute runs per hour. That’s a 10% speed increase.

The average compression with gzip that I’m getting for the SourceCred data folder(s) is ~80% reduction! That is absolutely massive! I believe this is partially because some data is stored in JSON format, and the sqlite .db files are uncompressed as well (maybe even sparse?).

I would highly encourage anyone who persists / transfers SourceCred data folders to make sure you are compressing it with whichever algorithm you like.

Minimum time required for `sourcecred load`

Initially my idea was to process the queue of repos to load in a wide-first approach. So that smaller repositories can get done quickly, and larger repositories will come later.

I did this by setting a target run length (e.g 40 minutes) and taking as many repos from the queue until hitting a minimum of 1 minute per repo (so max 40 repos). Whenever a load doesn’t complete within that 1 minute, I would kill it and rely on the cache to eventually load everything.

This completely failed at loading larger repositories. I could see two reasons for this.

At the end of a load, we will compute-cred once. For very large projects this can take as much as 8 minutes. So the mirroring of data would complete, but with a 1 minute budget, compute-cred would never complete.
There is a “startup time” where we’re loading and validating what’s already in the cache. For very large projects, this may take up a majority of the 1 minute budget, which gave even less time to compute-cred for problem 1. and gives very poor efficiency of the overall process.

So now I’ve changed the time budget approach. I’m still targeting 40 minutes. However each repo now has a maximum of 10 minutes to load before it’s killed. And I keep taking a new repo from the stack until we’ve exceeded 30 minutes. Most of the time this results in 30 + 10 minutes. And when loading smaller repos this allows it to complete fast without impacting the time for each run much.

Overall this has been quite successful. It still allows smaller repos to be completed sooner, to just get them over with and out of the queue. While also allowing larger repos to be processed.

Using the scores as a second cache layer

Pretty soon, I will need to update my sourcecred version to include https://github.com/sourcecred/sourcecred/pull/1407

Which means, the cache will need to be completely deleted, as the schema will be updated. This would also be a problem for any major updates to sourcecred.

The idea of losing 6 GB / 2 days worth of cached data kinda sucks . However, not all is lost (literally).

The scores file has it’s own versioning, and this is what we’re hosting. For our application of aggregating lots of data, it’s OK if it goes stale by a few days, or even 1-2 weeks. The meta.json file also allows you to determine exactly how old the data is, if your application can’t use data that’s too old.

What I’m planning to do here, is to have them back up each other. If I need to throw away the scores files. I need to make sure I can use the sourcecred cache to generate new scores quickly. If I need to throw away the cache. I need to make sure the already generated scores give me time to mirror data by letting the scores file go a little more stale.

Where’s the rate limit?

Previously I hit the rate limit by leaving a load command running.

However, now that I’m running a cronjob to load data 24/7 and at the same time have a load running on my workstation to test for https://github.com/sourcecred/sourcecred/pull/1407 I’m actually amazed I’ve not been rate limited at all since.

I wonder if something changed. The 10 minute kill switch maybe. Or maybe the other projects have a different data density? I’m not sure what’s happening here. But getting rate limited may be more rare than I thought.

Beanow · October 15, 2019, 10:44am

Mono vs multi-repo

Another thing I noticed. Currently this approach doesn’t do a great job at detecting the support behind a particular package when it’s developed in a monorepo.

Like the packages in the example:

- @mbostock, in projects: [
  'd3/d3-scale',
  'd3/d3-array',
  'd3/d3-format',
  'd3/d3-time-format',
  'd3/d3-time',
  'd3/d3-scale-chromatic'
]

Were detected because they are multi-repo.

On the other hand, all these npm packages go to the same babel repo.

{
    "@babel/core": "babel/babel",
    "@babel/plugin-proposal-class-properties": "babel/babel",
    "@babel/preset-env": "babel/babel",
    "@babel/preset-flow": "babel/babel",
    "@babel/preset-react": "babel/babel",
    "@babel/plugin-proposal-decorators": "babel/babel",
    "@babel/plugin-proposal-export-namespace-from": "babel/babel",
    "@babel/plugin-proposal-function-sent": "babel/babel",
    "@babel/plugin-proposal-json-strings": "babel/babel",
    "@babel/plugin-proposal-numeric-separator": "babel/babel",
    "@babel/plugin-proposal-throw-expressions": "babel/babel",
    "@babel/plugin-syntax-dynamic-import": "babel/babel",
    "@babel/plugin-syntax-import-meta": "babel/babel",
    "@babel/plugin-transform-runtime": "babel/babel",
    "@babel/runtime": "babel/babel",
    "@babel/runtime-corejs2": "babel/babel",
    "@babel/runtime-corejs3": "babel/babel",
}

Those are just the ones that came up from my sample set, but there’s many more: https://github.com/babel/babel/tree/master/packages#readme

The package.json format does have an optional way to indicate which folder of a repo the package lives in. Useful for monorepos. But it would be quite complicated for SourceCred to support splitting cred by folder.

decentralion · October 15, 2019, 3:59pm

At some point, I do want us to develop “scoped cred” functionality like this. Doing it at the folder level (who has cred in the GitHub plugin?) at the file level (who has cred in graph.js?) or at the repo level (when loading the whole org, who has cred in this sub-repository?)

However, I agree that it will take a lot of engineering work to get these capabilities.

Topic		Replies	Views
Preliminary CredSperiment Cred The CredSperiment	20	3548	October 6, 2019
SourceCred Protocol The CredSperiment 📦artifact	5	2517	January 11, 2020
A Gentle Introduction to Cred Research	10	3446	December 29, 2019
Odyssey "Manual Mode" Brainstorming The CredSperiment	4	905	April 10, 2019
Visualizing the SourceCred graph Initiatives Wish List	5	1956	February 16, 2020