Potentially related work:
https://sourcecred.io/odyssey-hackathon/
opened 10:44PM - 12 Mar 19 UTC
## Research Roadmap
1. Document the data model for the contribution graph tha… t is being captured by the source cred team today.
2. Build on the existing model to establish a general semantic or "space of contribution graphs" which characterizes all legal contribution graphs (including accounting node types and potentially subtypes). This formal definition will serve as the domain for the Heuristics that enrich the network with weights (transition probabilities).
3. Construct one or more credit flow heuristics for every type of edge defined in the "space of contribution graphs" so that it is possible to uniquely define a view from a particular graph. Must include human readable descriptions of what the heuristics interpretation as a credit flow. The resulting matrix must be a markov chain (row stochastic matrix).
4. Construct one or more seed vector functions along with human language descriptions of the intended interpretation of driving the mixing process from such a seed.
5. Using data sets collected by the source cred project explore the space of algorithms by prototyping in a scripting language; explore sensitivity of rankings to a variety of choices ranging from differing heuristics, to parameter sweeps of alpha and seed choices.
6. Emulate game behavior by attempting to optimize for ranking through attack vectors such as spamming events or sybil attacks.
7. Support the source cred in implementing and testing algorithms based on this research.
![image](https://user-images.githubusercontent.com/10465438/54241313-b7957f80-44dd-11e9-9670-a612e01eede5.png)
### Lab setup:
1. get excess the graph data (sample data set is fine)
2. exploratory data analysis better understand what that data is
3. create some synthetic graph generators so we can test the algorithms on different assumptions about user/contributor structures (including attacks i.e. small or empty contribution spam)
4. hack together a script to go through stages like those in my multi-class page rank algorithm outline
5. clearly define some metrics/measures to evaluate properties of resulting rankings
6. automate some validation analysis for those metrics measures
### Now the research lab is set up, algorithm research actually starts:
1. establish some hypothesis about different heuristics, properties they should have and conditions under which they are or not effective
2. use the graph generator to explore more specific attack vectors and/or newly imagined test cases
3. Run lots of experiments and iterate toward specific algorithm constructions to determine what works best for source cred current use cases
4. provide guidelines for producing other rankings with different requirements ~ ideal IMO is that others building on the is ecosystem should be possible without having to have expertise in the graph theory at the level that originally deriving and testing these initial algorithms requires
opened 02:27PM - 13 Apr 19 UTC
After reviewing the existing data visualizations and data models, met with @dece… ntralion to organize the effort towards meeting graph visualization use cases during the Hackathon. As such this issue identifies use cases and specifically focuses attention aspects to be addressed on site.
There are two use cases defined for graph visualizations
1. Editor: as a creator and/or editor of manual nodes and edges, I wish to see the nodes I have identified, the nodes I have added, and the edges associated with those nodes.
2. Explorer: as an explorer of the SourceCred graph, I wish to select a focus and a seed, then to see a neighborhood on nodes around the focus, with the choice of nodes determined by personalized pagerank relative to the seed. Furthermore, selecting a node in the from should change the focus to the selected node, but not change the personalized pagerank scores being used to select the nodes. [yes i know this is in graph speak not user language]
Case 1:
Upon review of the options, I determined that use case one is sufficiently well handled by the force directed graph layout and simple inclusions logic:
- include manual nodes added during the editor session
- include any nodes selected during the editor session
- include any edges between the included nodes
- plot via force directed layout or any built in position selector
- it may be necessary to impose a limit on the total number of selected nodes
Case 2:
In light of the analysis in case 1, i decided to decompose case 2, to treat the choice of which nodes to visualize and the algorithm for determining their positions. The case of determining their positions may be reduced to the same as position choices in case 1 and for the time being will be left to built in methods. In this case the choice of nodes to plot becomes the primary algorithm of interest for the hackathon data visualization.
Proposed Algorithm for node selection in case 2 is based on customized graph traversal approach in the spirit of @decentralion suggestion during our meeting:
inputs:
- identifier for the focus node
- for all nodes mapping from identifier to cred score (personalized pagerank variation),
- the edge data for the graph (required for computing shortest paths)
pseudo code:
```
step 1. *set up*
set anchor to be node selected by user
compute pagerank according to the seed selected by the user
set the number of nodes 'K' you wish to display
step 2. *score the nodes by path*
for each node in nodes:
path[node] = list of nodes in the shortest path from node to anchor
cost[node] = len(path[node])
reward[node] = sum(PR(hop) for each hop in path[node])
value [node] = reward[node]/cost[node]
step 3. *select nodes by score*
ordered = list of nodes ordered by value
k=0
nodes_to_plot = []
while k < K and len(ordered>0):
current = ordered.pop(0)
if cost[current] <= K-k:
k= k+cost[current]
for hop in path[current]:
nodes_to_plot.append(hop)
if hop in ordered: remove it
step 4. *conclude*
return nodes_to_plot
```
Tests:
- returned list should be length K
- all nodes returned should be valid keys in the node to cred mapping provided
- increasing the K should always result in a larger total cred over all nodes included
- all nodes returned must have a path to the anchor
The proposed methods is a variation (arguably multi-layer variation) of dijkstra's algorithm https://hackernoon.com/how-to-implement-dijkstras-algorithm-in-javascript-abdfd1702d04
the link above is a simple overview; i'll defer on what packages might make sense to use.
Working in conjunction with David Sisson and @mzargham , I’ve put together some early exploratory data work on the Source Cred graph data.
For the data itself, I did a short write up here . One question David had involved the usernames (or agents) being in the edges but not the nodes. @mzargham suggested that in the current model, agents are just users, but we may explore the question further. From a chart perspective, agent nodes would likely “pull” a great deal of other nodes into their orbit, …
From A Gentle Introduction to Cred
If we expand a single node, we can see how that node received its cred via its connections to other nodes. At the top level, it aggregates groups of connections based on the type of edge, and the type of node the edge connects to. The percentages show what fraction of the node’s cred came from that connection, and the numbers show how much total cred came from that connection.
Then, diving down within a particular group of connections, we can see all of the individual edges along with how much cred they contributed.
If we want to learn more about a particular edge, we can expand it to see the node that edge connects to. This gives us the ability to dive into the graph from a fresh starting point. As you go “deeper” in your exploration of the graph, the color becomes deeper as well.
Cred Analysis Notebooks
Status: help wanted
Occasionally new types of Notebooks are created, but there’s room for a lot more.
Additionally it’s a great starting point for new contributors.
Please give a shoutout if you’d like to help.
Champion:
None atm.
It would be very helpful if a Champion could oversee the quality, organization and prioritized wishlist
of Notebooks.
Initiative Description:
Observable Notebooks have proven extremely useful tools for building new interfaces, experiences, and…