-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Contributors: Paul Groth, Mike Taylor
Goals and Summary
Many datasets are the product of the aggregation of data from multiple sources and often multiple parties. For example, the Open PHACTS platform (http://dev.openphacts.org) provides integrated access to over 10 different databases. These databases, for example, Chembl and Uniprot, are amalgamations of data extracted both automatically and via human curation from other sources such as the literature. These databases in turn rely on data models and ontologies developed by still others. For example, the GO or Chebi ontologies. Additionally, integrators may slightly change datasets through format changes or addition of links.
- Issue 1: The provenance or credit chain of a single answer given by a data integration platform can be much bigger than the answer itself. How do we correctly ensure credit is given to all actors in the system? Furthermore, how do we ensure that these chains can be effectively traced back?
An additional aspect of such integrations is that they are all under context flux. Note that, Uniprot is released every 4 weeks. Chembl is released quarterly, and some such as SureChembl are hourly. How does a data integrator appropriately capture and expose this information? Currently, this is often done by providing versioned data dumps. However, this may not be allowed or supported in many cases due to licensing, technical or policy issues.
- Issue 2: How do we cite developing data sets originating from multiple sources?
Why is it important and to whom?
- Down stream data providers need to be given appropriate credit
- Usage is a key metric for continued funding of these databases.
- Funders would like to know what data sources are being effectively used.
- Curators need to be credited with the important work that they do
- Users would like to know best practice in terms of citation. The how to cite us page very across data sets. Furthermore, it's difficult to cite specific time dimensions.
- Encourages the fundamental work of data integration.
Why hasn’t it been solved yet?
- No current agreement on what data integrators should supply in terms of citation
- Need to be able to expand the entire provenance trace easily which requires agreement across the data supply chain. While standards such as W3C PROV provenance standard exist, they are still not widely used.
- Inability to appropriately "ping back" or notify upstream data providers about usage
- Need to combine data and software citation