Research Design: Exploratory Data Analysis

ryanMorton · April 11, 2019, 5:32pm

Working in conjunction with David Sisson and @mzargham, I’ve put together some early exploratory data work on the Source Cred graph data.

For the data itself, I did a short write up here. One question David had involved the usernames (or agents) being in the edges but not the nodes. @mzargham suggested that in the current model, agents are just users, but we may explore the question further. From a chart perspective, agent nodes would likely “pull” a great deal of other nodes into their orbit, but maybe that’s what we want.

Speaking of which, I have a mock up of the graph data using a d3.js script I’m building here. I added some opacity CSS to make the viewing a bit easier. It appears to me that the data (without any scoring, weighting, or filtering to smaller sub graphs) illustrates some patterns. There are three major “bundles”: two are surrounded by comment nodes in orange and one with a decent mix of nodes but perhaps more review nodes in brown and pull nodes in red. I think I’ll want to add some zoom functionality as I’d like to see what those three nodes are! Anyhow, this will be on-going development for the network chart so feel free to weigh in on the topic.

decentralion · April 11, 2019, 7:50pm

I’m a little confused by this. The usernames are included in the nodes. Consider the node with address: ["sourcecred", "github", "USERLIKE", "USER", "decentralion"]

I’d like to explain a bit how the node addresses / edge addresses are generated. The first two components represent the plugin that generated the data, as [pluginOwner, pluginName]. Right now the two plugins that exist are the [“sourcecred”, “github”] plugin and the [“sourcecred”, “git”] plugin. We named them with the “sourcecred” prefix so that another organization could make their own “github” plugin without producing a naming collision.

After the first two components comes a series of type names, in ALL_CAPS. There can be more than one type designator if it’s effectively a subtype. So there is the type [“USERLIKE”, “USER”], or [“USERLIKE”, “BOT”], or just [“REPO”].

You can see all the type prefixes for GitHub or for git.

After the type information comes information that is needed to uniquely identify the node/edge. It depends on the type of edge. For a pull request, it will be [repoOwner, repoName, pullNumber]. For a user, it will be [accountName]. For edges, it’s often a concatenation of information needed to identify the source and information needed to identify the dst.

This is cool! Thanks for sharing.

The node in the center is the sourcecred/sourcecred repository itself. It is connected by a HAS_PARENT edge to every issue and pull request in the repository (and thus transitively to every review and comment).

The nodes on the left and right are the project’s two most prolific contributors, namely @wchargin and myself. Using William as an example, he has over 1300 comments in the repository, hence the large number of comment nodes connected to him.

In the writeup, you mention that having scores would be helpful. I can get those to you pretty easily now that we have a CLI command that generates the scored PagerankGraph for any repository. To start, here is a gist containing an up-to-date pagerankGraph for sourcecred/sourcecred. The data format is basically an old-style graph along with additional data with every edge weight and every node’s score. Here is the declaration of the data format and here is the logic that (de)serializes it.

ryanMorton · April 11, 2019, 8:09pm

@decentralion, I used https://github.com/sourcecred/research/blob/master/sample-graphs/sourcecred_sourcecred.json for the write-up. Is there a better data set that has the nodes in the format you expected (the one I used rarely has user names in the nodes)?

I’ll amend the types to represent the all caps logic - thanks for explaining that!

Thanks for all the links! I’ll start working through those.

decentralion · April 11, 2019, 8:22pm

That data set has the username in every user node. However, user nodes are enormously outnumbered by other types of nodes. For example, there are several thousand comment nodes, but only about 20 user nodes. (This is also why user nodes tend to have very high degree, as evidenced by the fact that two of the three huge attractors in the graph are user nodes.)

You may want to take a look here for an example of type parsing logic, so you don’t need to invent it yourself.

Also, you could consider creating your fiddles inside the SourceCred codebase (where you’ll have access to all of the SourceCred infrastructure and APIs). Take a look here to get a sense for the APIs that you could be calling, rather than needing to poke around the internals of the data structure. If you want to go down this route let me know so @brianlitwin or I can help you get set up.

ryanMorton · April 11, 2019, 8:34pm

Oh, I see. I think the user nodes fell off the text analysis because they had so few counts, not because they’re absent. I’ll let David know.

EDITED: scores really helped the data viz take a more discernible shape seen here.

I would like to just use the available APIs if I could. Thanks for your patience as I on-board here!

decentralion · April 11, 2019, 8:59pm

Cool! Eventually, we should publish the core data structure and methods as a npm package that people can depend on and use in separate projects. For now, it would be simplest for you to do your development in a fork or branch of sourcecred/sourcecred.

As a first step, you should familiarize yourself with how the codebase works. You can check out the following video resources: our codebase walkthrough and the live coding session for the PagerankGraph class.

Afterwards, you could hack your logic into the existing prototype, and once we’ve developed some interesting apps we can work together to factor them into a clean form that we can merge into the mainline codebase. As an example, you could modify PagerankTable to render your own logic alongside the existing UI.

ryanMorton · April 16, 2019, 10:04pm

So, I’ve reviewed the first video and part of the second. I also looked through the GitHub code on the data viz from the hackathon (good work!). I’m wondering if we need to approach this from a couple angles: sub-graph (10’s of nodes) and full graph (1000’s of nodes).

For subgraphs, we can keep the SVG graphic, use labels, add the halo you like, and offer communication from the SVG user interactions back to the server as needed. I guess I’d see this functioning as part of the app primarily and research secondarily.

For the full graph, I’m wondering if a canvas graphic would be better. Writing and running simulation on thousands of and tags has a performance drag. Adding labels really doesn’t help that problem either. The nice thing about the full graphs though is that they have a real topography when the simulation stabilizes.

Also, @decentralion, I’m not sure you really want me working with the app as described - I really don’t know anything about React. I learned d3.js so R and Python users could access the functionality in the relavent web frameworks (R-Shiny, Python Jupyter Notebooks). I’m doing my best to learn, but it may be best to keep me focused on getting the d3 right for the time being. I think the React code structure will be easy to work with if I’m understanding the structure correctly - I’ll check in if/when it doesn’t. There may be some good crossover writing functions for both React in the app and Python for @mzargham and @davidfs research work. Does that sound ok? I’m open-minded on all of this so please “right-size” my beliefs here!

decentralion · April 17, 2019, 1:02pm

First off–@ryanMorton if you haven’t had a chance to play with the UI we made at the hackathon, check it out here. I used your work as a starting point, so you definitely have cred in that product.

Personally, I’m not sure if we should be trying to display the whole graph. I’m worried it will just be a noisy mess trying to put so much information in one screen. My intention was that we would keep building up this subgraph viewer, with the assumption that at any time, <100 nodes are in scope. The user will choose some nodes that they are interested in, which anchor the scope, and then we use a score-aware graph traversal algorithm to fill out the remaining nodes.

For example, suppose we are building a graph-based workflow for answering the question, “where did this node get cred from”? Then the user chooses the node they are interested in, and we find the paths to that node that contributed the most score, e.g. using @mzargham’s algorithm here. Using that, we find the 99 most-relevant nodes, and display those. Then the user can double click on a new node to select it, which results in a new group of nodes coming into scope.

This won’t be a super good fit for our current graph, where the most interesting nodes (users) characteristically have enormously high degree. So if I have my cred split across 900 pull requests I authored, but we’re only willing to display the top 100 nodes, then we’re going to miss a lot of the picture. We could think about doing some aggregation like we do on the current prototype though it’s not clear how this would work in a graph layout.

Maybe this UI will wind up being most useful for inspecting low-degree nodes (e.g. a particular pull request) but not inspecting users. (BTW, I expect a major use case for this UI will be manually adding new nodes/edges to the graph.) This could be problematic if as we collect more data, nodes tend to become higher and higher degree, so we may need a way to do “graph compression” or to otherwise collapse nodes/edges together. (E.g. could we imagine collapsing a pull request and all of its comments into one node, while maintaining the right cred-flow properties? cc/ @mzargham.)

Also, @mzargham and I have had some discussion in the past about finding ways to make users not so high degree; e.g. my user node has a connections to every month-long period that I was active in the project, and then those user-period nodes are connected to all the contributions from that time period.

From an implementation standpoint: for complex visualizations like this I believe it’s very important to have unit testing, otherwise it’ll become really hard to maintain as we keep adding features to it. We’ll also want to have a good API for interacting with the visualization so that we can try to build UIs on top of the graph rather than in the graph. Which implies finding a good way to fit the graph into React’s state and props abstractions. I’ll need to do some research and find a good way to do this, the prototype code from the hackathon isn’t really maintainable in that sense, and has some gross state contamination between React and D3. Probably for now it’s best if you focus on prototyping out algorithms and visualizations, and I worry about productionizing them into a form that we can ship and maintain.

ryanMorton · April 17, 2019, 3:21pm

Perfect, I think that’s a good plan. I’ll keep an eye on the React+d3 issue and develop a better sense of the how to structure the code.

I can certainly work with less than 100 nodes. I do think seeing complete picture can be more than just a fuzz ball mess - though I suppose fuzz ball is probably an accurate description of some networks/projects! I’ll prioritize the lower node count prototype in case you do decide to aggregate and such, and I’ll work on the global view if time allows.

(the UI link is broken when I try it - the other links work though)

decentralion · April 17, 2019, 3:24pm

Oops–should be fixed now.

Topic		Replies	Views
Visualizing the SourceCred graph Initiatives Wish List	5	1954	February 16, 2020
Analyzing SourceCred data	2	846	April 27, 2020
Data Analytics Research	2	542	December 6, 2021
Data model for SourceCred Research	3	831	April 22, 2019
Populating the graph from the bottom up Research	5	861	April 29, 2019

Research Design: Exploratory Data Analysis

Related topics