Visualizing the SourceCred graph

burrrata · August 18, 2019, 7:00pm

Not sure where to put this so I’m just creating a new thread. If this is better as a comment elsewhere feel free to move it or let me know and I’ll delete and repost.

Came across this tweet visualizing the progress of the PRs on the Prysm project (Eth2.0). Something similar for SourceCred would be really cool where someone could, if they wanted, playback the history of a project and see all the connections and how cred was earned. Obviously that’s a nice to have vs core feature, but it really does help to make the data come alive in a clear and intuitive way

So essentially, like this demo, but with a timeline feature. Maybe first displaying the simple historical chart, and then giving users the option to view it more dynamically? If that could be integrated into the homepage of the website it would make the value prop of SourceCred immediately intuitive for so many people.

If that’s too complex or requires too much computation, maybe just take snapshots at regular intervals (like this) that the user could cycle through to see progress on project development and the flow of cred/grain in response?

Beanow · December 22, 2019, 11:54am

I think visualizing the graph would be helpful in a lot of places.

One that I’m imagining is in the new Explorer UI, showing the neighbor nodes in a graphic and being able to use them for navigation.

For ones that visualize the entire graph, I would experiment with existing tools to see how it turns out. My gut says it will be pretty tangled and have a lot of nodes, so it may not be very telling by shape. Giving the nodes sizes based on their cred score might be really interesting though.

decentralion · December 22, 2019, 6:41pm

I’ve got a fair amount of experience with graph visualizers (worked alongside the team that made the TensorBoard graph visualizer). The SourceCred graph is absolutely enormous and making sense of it “raw” (or even rendering it) will be impossible. That means if we want to use a graph visualizer, we need either:

To aggressively filter the set of nodes in scope, e.g. show only nodes 1-degree away from a target node. However, it’s easy for a user to be connected to 1k+ nodes, which means using this technique on users (the most interesting nodes) is already a non-starter
To find some way to “compress” the graph, extracting only salient information

As an example of graph “compression”: maybe we could find a way to collapse the graph down to supernodes and users, while still maintaining meaningful edges. E.g. the “Discourse Artifact” supernode may not be directly connected to @beanow, but it may have a lot of cred paths @beanow. Need to think about the math and probably talk to @mzargham, but I suspect we could do this graph collapse by designating every supernode as a seed, seeing which users cred flows to, and then normalizing by the node’s own cred. (Might need to do this once per time interval, though.)

If that technique worked and we could collapse down to a faithful supernode-and-user map, then we could drive a really interesting graph explorer that represents content at a level of abstraction that’s intelligible to users.

(Note this would only be meaningful for projects that made extensive use of the supernode system… if most of the cred was flowing based on activity, it wouldn’t produce meaningful results.)

We could use a similar technique to do a user->user reduced cred map, which would be great for discovering collaboration patterns (or cliques!).

s_ben · December 25, 2019, 4:05am

Having trouble wrapping my head around “normalizing by the node’s own cred”, but I’m new to graph theory, so wouldn’t worry too much about that:) This does bring up an idea that’s been rattling around my brain lately though, which is normalizing by repo/maintainer. Basically, SC does a surprisingly good job “out of the box”, with default parameters. As @Beanowm, myself, and now @burrrata have discovered, it’s really fun and insightful for repos you already know. However, SC is not good (in my experience) comparing activity across repos. A mainly front-end repo with frequent changes, for instance, will generate a lot more cred than a repo containing blockchain consensus code, which hopefully doesn’t change that much actually, and the number of changes is not necessarily indicative of the significance/value of the change.

However, within a repo, it is fairly obvious to spot the “core contributors” (i.e. full-time devs adding lots of value). This seems like an obvious point of reference. Indeed one that projects tend of focus on (e.g. “we just need more of person X!”). What if we can normalize by that?

Generalizing this point, perhaps the question is, “from the perspective of a user, what are the meaningful categorizations they typically apply?”. Off the top of my head, my first questions would be, “who is actually working on this?”, “are they meaningful to me in other contexts (e.g. am I potentially going to work with them in the future on something else)?”, “how does this relate to other initiatives, particularly those I might be paid to work on?”. It might also be cool to see some general abstract overviews, potentially beautiful ones created by artists working with data.

burrrata · December 27, 2019, 2:05am

Potentially related work:

https://sourcecred.io/odyssey-hackathon/

github.com/sourcecred/notes

Draft Research plan

opened 10:44PM - 12 Mar 19 UTC

mzargham

## Research Roadmap 1. Document the data model for the contribution graph tha…t is being captured by the source cred team today. 2. Build on the existing model to establish a general semantic or "space of contribution graphs" which characterizes all legal contribution graphs (including accounting node types and potentially subtypes). This formal definition will serve as the domain for the Heuristics that enrich the network with weights (transition probabilities). 3. Construct one or more credit flow heuristics for every type of edge defined in the "space of contribution graphs" so that it is possible to uniquely define a view from a particular graph. Must include human readable descriptions of what the heuristics interpretation as a credit flow. The resulting matrix must be a markov chain (row stochastic matrix). 4. Construct one or more seed vector functions along with human language descriptions of the intended interpretation of driving the mixing process from such a seed. 5. Using data sets collected by the source cred project explore the space of algorithms by prototyping in a scripting language; explore sensitivity of rankings to a variety of choices ranging from differing heuristics, to parameter sweeps of alpha and seed choices. 6. Emulate game behavior by attempting to optimize for ranking through attack vectors such as spamming events or sybil attacks. 7. Support the source cred in implementing and testing algorithms based on this research. ![image](https://user-images.githubusercontent.com/10465438/54241313-b7957f80-44dd-11e9-9670-a612e01eede5.png) ### Lab setup: 1. get excess the graph data (sample data set is fine) 2. exploratory data analysis better understand what that data is 3. create some synthetic graph generators so we can test the algorithms on different assumptions about user/contributor structures (including attacks i.e. small or empty contribution spam) 4. hack together a script to go through stages like those in my multi-class page rank algorithm outline 5. clearly define some metrics/measures to evaluate properties of resulting rankings 6. automate some validation analysis for those metrics measures ### Now the research lab is set up, algorithm research actually starts: 1. establish some hypothesis about different heuristics, properties they should have and conditions under which they are or not effective 2. use the graph generator to explore more specific attack vectors and/or newly imagined test cases 3. Run lots of experiments and iterate toward specific algorithm constructions to determine what works best for source cred current use cases 4. provide guidelines for producing other rankings with different requirements ~ ideal IMO is that others building on the is ecosystem should be possible without having to have expertise in the graph theory at the level that originally deriving and testing these initial algorithms requires

github.com/sourcecred/odyssey-hackathon

Algorithms for Graph Visualization

opened 02:27PM - 13 Apr 19 UTC

mzargham

After reviewing the existing data visualizations and data models, met with @dece…ntralion to organize the effort towards meeting graph visualization use cases during the Hackathon. As such this issue identifies use cases and specifically focuses attention aspects to be addressed on site. There are two use cases defined for graph visualizations 1. Editor: as a creator and/or editor of manual nodes and edges, I wish to see the nodes I have identified, the nodes I have added, and the edges associated with those nodes. 2. Explorer: as an explorer of the SourceCred graph, I wish to select a focus and a seed, then to see a neighborhood on nodes around the focus, with the choice of nodes determined by personalized pagerank relative to the seed. Furthermore, selecting a node in the from should change the focus to the selected node, but not change the personalized pagerank scores being used to select the nodes. [yes i know this is in graph speak not user language] Case 1: Upon review of the options, I determined that use case one is sufficiently well handled by the force directed graph layout and simple inclusions logic: - include manual nodes added during the editor session - include any nodes selected during the editor session - include any edges between the included nodes - plot via force directed layout or any built in position selector - it may be necessary to impose a limit on the total number of selected nodes Case 2: In light of the analysis in case 1, i decided to decompose case 2, to treat the choice of which nodes to visualize and the algorithm for determining their positions. The case of determining their positions may be reduced to the same as position choices in case 1 and for the time being will be left to built in methods. In this case the choice of nodes to plot becomes the primary algorithm of interest for the hackathon data visualization. Proposed Algorithm for node selection in case 2 is based on customized graph traversal approach in the spirit of @decentralion suggestion during our meeting: inputs: - identifier for the focus node - for all nodes mapping from identifier to cred score (personalized pagerank variation), - the edge data for the graph (required for computing shortest paths) pseudo code: ``` step 1. *set up* set anchor to be node selected by user compute pagerank according to the seed selected by the user set the number of nodes 'K' you wish to display step 2. *score the nodes by path* for each node in nodes: path[node] = list of nodes in the shortest path from node to anchor cost[node] = len(path[node]) reward[node] = sum(PR(hop) for each hop in path[node]) value [node] = reward[node]/cost[node] step 3. *select nodes by score* ordered = list of nodes ordered by value k=0 nodes_to_plot = [] while k < K and len(ordered>0): current = ordered.pop(0) if cost[current] <= K-k: k= k+cost[current] for hop in path[current]: nodes_to_plot.append(hop) if hop in ordered: remove it step 4. *conclude* return nodes_to_plot ``` Tests: - returned list should be length K - all nodes returned should be valid keys in the node to cred mapping provided - increasing the K should always result in a larger total cred over all nodes included - all nodes returned must have a path to the anchor The proposed methods is a variation (arguably multi-layer variation) of dijkstra's algorithm https://hackernoon.com/how-to-implement-dijkstras-algorithm-in-javascript-abdfd1702d04 the link above is a simple overview; i'll defer on what packages might make sense to use.

The “SourceCred Explorer” (currently out of order)

From A Gentle Introduction to Cred

If we expand a single node, we can see how that node received its cred via its connections to other nodes. At the top level, it aggregates groups of connections based on the type of edge, and the type of node the edge connects to. The percentages show what fraction of the node’s cred came from that connection, and the numbers show how much total cred came from that connection.

Then, diving down within a particular group of connections, we can see all of the individual edges along with how much cred they contributed.

If we want to learn more about a particular edge, we can expand it to see the node that edge connects to. This gives us the ability to dive into the graph from a fresh starting point. As you go “deeper” in your exploration of the graph, the color becomes deeper as well.

mzargham · February 16, 2020, 8:43am

A lot of this needs to be curated in new up-to-date issues. I think a lot of the thinking from the period you are citing remains relevant but there were simply other priorities at the time. I remain interested in the visualization work-stream but I do not have a lot of time to dedicate to it at the moment.

Probably the most help I can be is in co-mapping out an up-to-do visualization initiative but being unable to champion it personally, my offer is to support someone who is interested in taking it on.

Topic		Replies	Views
A Gentle Introduction to Cred Research	9	3530	December 29, 2019
Research Design: Exploratory Data Analysis Research	9	1051	April 17, 2019
SourceCred Protocol The CredSperiment 📦artifact	5	2614	January 11, 2020
Cred Analysis Notebooks Initiatives 🌱up-for-adoption	5	2163	December 30, 2019
Odyssey "Manual Mode" Brainstorming The CredSperiment	4	938	April 10, 2019

Visualizing the SourceCred graph

The “SourceCred Explorer” (currently out of order)

Related topics