Experiment: SourceCred stack lookup

Some lessons learned

The value of this hosted data approach

Loading the current list of 159 GitHub repositories, took about 2 days to do from an empty cache.

Using the client to download and interpret the scores, is a matter of seconds. That’s saving you incredible amounts of time. This opens up new applications that would otherwise be infeasible.

This idea can be leveraged further by pre-calculating a minimal amount of data needed for your application based off of the scores, instead of having the client download the full scores file and analyze locally. (Though, I think my questionable bus-factor calculation doesn’t deserve this optimization any time soon)

I’m planning to keep this data available for anyone who would like to experiment with aggregated data. If you need help getting set up, or would like to include more repositories in the data set, get in touch!

Data compression

Right now, there are 159 GitHub repos scored. Meaning 159 cache databases, graphs and scores. That’s a lot of data. Something to the tune of 6 GB. And just the scores grew to something on the order of 600 MB.

Pushing a 600 MB commit to host them on gh-pages, and gzip / gunzip of the several-GB SourceCred cache, soon created a bottleneck. :sweat_smile:

What I’ve done now is gzip the scores before pushing to gh-pages. And have client gunzip them for you. That was a 7x speed increase and similar compression ratio.

Also, I’m now giving every github repo in it’s own “instance” instead of one very large SourceCred data folder. I will gunzip just one instance before I try to load and score it. Then gzip that instance when done. This saves about 4 minutes of needlessly (un)compressing data to temporary storage every hour the cronjob triggers. Because I’m targeting about 40 minute runs per hour. That’s a 10% speed increase.


The average compression with gzip that I’m getting for the SourceCred data folder(s) is ~80% reduction! That is absolutely massive! I believe this is partially because some data is stored in JSON format, and the sqlite .db files are uncompressed as well (maybe even sparse?).

I would highly encourage anyone who persists / transfers SourceCred data folders to make sure you are compressing it with whichever algorithm you like.

Minimum time required for sourcecred load

Initially my idea was to process the queue of repos to load in a wide-first approach. So that smaller repositories can get done quickly, and larger repositories will come later.

I did this by setting a target run length (e.g 40 minutes) and taking as many repos from the queue until hitting a minimum of 1 minute per repo (so max 40 repos). Whenever a load doesn’t complete within that 1 minute, I would kill it and rely on the cache to eventually load everything.

This completely failed at loading larger repositories. I could see two reasons for this.

  1. At the end of a load, we will compute-cred once. For very large projects this can take as much as 8 minutes. So the mirroring of data would complete, but with a 1 minute budget, compute-cred would never complete.

  2. There is a “startup time” where we’re loading and validating what’s already in the cache. For very large projects, this may take up a majority of the 1 minute budget, which gave even less time to compute-cred for problem 1. and gives very poor efficiency of the overall process.

So now I’ve changed the time budget approach. I’m still targeting 40 minutes. However each repo now has a maximum of 10 minutes to load before it’s killed. And I keep taking a new repo from the stack until we’ve exceeded 30 minutes. Most of the time this results in 30 + 10 minutes. And when loading smaller repos this allows it to complete fast without impacting the time for each run much.

Overall this has been quite successful. It still allows smaller repos to be completed sooner, to just get them over with and out of the queue. While also allowing larger repos to be processed.

Using the scores as a second cache layer

Pretty soon, I will need to update my sourcecred version to include https://github.com/sourcecred/sourcecred/pull/1407

Which means, the cache will need to be completely deleted, as the schema will be updated. This would also be a problem for any major updates to sourcecred.

The idea of losing 6 GB / 2 days worth of cached data kinda sucks :sweat_smile: . However, not all is lost (literally).

The scores file has it’s own versioning, and this is what we’re hosting. For our application of aggregating lots of data, it’s OK if it goes stale by a few days, or even 1-2 weeks. The meta.json file also allows you to determine exactly how old the data is, if your application can’t use data that’s too old.

What I’m planning to do here, is to have them back up each other. If I need to throw away the scores files. I need to make sure I can use the sourcecred cache to generate new scores quickly. If I need to throw away the cache. I need to make sure the already generated scores give me time to mirror data by letting the scores file go a little more stale.

Where’s the rate limit?

Previously I hit the rate limit by leaving a load command running.

However, now that I’m running a cronjob to load data 24/7 and at the same time have a load running on my workstation to test for https://github.com/sourcecred/sourcecred/pull/1407 I’m actually amazed I’ve not been rate limited at all since.

I wonder if something changed. The 10 minute kill switch maybe. Or maybe the other projects have a different data density? :thinking: I’m not sure what’s happening here. But getting rate limited may be more rare than I thought.

1 Like