Data model for SourceCred

I’ve been familiarizing myself with W3C’s PROV data model, http://www.w3.org/TR/2013/NOTE-prov-primer-20130430/. The work is about six years old, but it seems like it may be useful in coming up with a conical data model for SourceCred. There’s even a Git2PROV implementation, https://github.com/IDLabResearch/Git2PROV.

Is there any history within the SourceCred community with PROV?

Interesting. I don’t think this has been discussed in the SourceCred community before (unless you or @mzargham have brought it up). I do note that the Entity/Agent/Activity system maps onto the ‘Author/Content/Event’ framing proposed in ‘Multiclass Pagerank’.

image

This is cool. I had not extended the thinking this far but I am a proponent of having an underlying formalism for interpreting graphs as event sequences.

Thought 1: this fits very well with the general framework posited in the multiclass pagerank doc

Thought 2: The SourceCred graph does not currently have the structure that maps to this but it is also not that far away.

Thought 3: Further formalizing these types of relations, which provide implicit temporal relations via events (activity) could help resolve the effect of time with out explicitly using time.

Thought 4: This does not handle the problem how agents/authors having extremely high degree, but I am wondering if we can explore down this path to find a remedy for that concern.

To @decentralion point, this hasn’t been discussed widely yet; its the ongoing thread of Research of @davidfs; I think now that he has had a chance to review the related literature and to posit a relation to SourceCred, it would be a good time to start iterating with the community. This thread is as good a place as any to start.

SourceCred is a subset of provenance focusing on lineage and attribution. As described by the PROV Working Group within W3C,

Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. The PROV Family of Documents defines a model, corresponding serializations and other supporting definitions to enable the inter-operable interchange of provenance information in heterogeneous environments such as the Web.

I ran across PROV while researching a discussion with @mzargham a while back out of which SourceCred’s author/content/event substrate stood out pretty plainly. The parallels between PROV and that substrate also stand out pretty plainly, as @decentralion noted. Something like PROV could be the means of creating a rich, graph-based model of provenance for an arbitrary decentralized project. This provenance graph could be transformed into a canonical contribution graph along the lines proposed in ‘Multiclass Pagerank’.

One particularly interesting aspect of PROV, PROV-O, has a concept of “qualified terms”. Qualified terms provide a means to add attributes describing the influence between resources–specific sets of attributes relevant to different types of decentralized projects. Each type of decentralized project would have a canonical provenance graph. SourceCred would have a canonical contribution graph. Creating a project-type-specific provenance graph and transforming it to a canonical contribution graph would be what plug-ins do. In this way plug-in architecture and design can be standardized across arbitrary project types.

The following illustrates this idea (taken from PROV-O: The PROV Ontology). The contribution of one resource to another (black, dotted arrows) can be delineated by or inferred from the underlying provenance (blue, solid arrows). With sufficient detail, perhaps weights of contributions might be modeled from attributes in the qualified terms of the provenance graph.