Question 3: What are the emerging tools, standards and infrastructures?
The new paradigm for interoperability on the web and for building the basic layer for a semantic web is the concept of Linked Open Data1 (LOD).
Instead of pursuing ad hoc solutions for the exchange of specific data sets, the concept of linked open data establishes the possibility to express structured data in a way that it can be linked to other data sets that are following the same principle. Examples of an extensive use of "linked open data" technologies are the NYT or the BBC news service. Some governments too are pressing heavily to publish administrative information as LOD.
The Linking Open Data cloud diagram
The technology of LOD is based on W3C standards such as the "Resource Description Framework2" (RDF), which facilitates the exchange of structured information regardless of the specific structure in which they are expressed at the source level. Any database can easily be expressed using the RDF, but also structured textual information from content management systems can be expressed in RDF. The presentation of data in RDF makes them understandable and processable by machines, which are able to mash up data from different sites. There are now mainstream open source data management tools like Drupal or Fedora commons which already include RDF as the way to present data.
Within the area of agricultural research for development an infrastructure to facilitate the production of linked open data is needed. The four key elements to make this possible are:
a registry of services and data sets (CIARD RING,http://www.ring.ciard.net);
common vocabularies to facilitate automatic data linking (thesauri, authority files, value vocabularies);
technology (content management systems, RDF wrappers for legacy systems);
training and capacity development
1 Linked Data - Connect Distributed Data across the Web http://linkeddata.org/ Last accessed March 2011
2 Resource Description Framework http://www.w3.org/RDF/ Last accessed March 2011
Data citation
One way to encourage the sharing of data is to develop the practice of data citation.
Here are two useful background documents, one from the Australian National Data Service (ANDS) and the other from the
ANDS and the other from Gen2Phen, an EU project focusing on Health and Life science research data.
- Data Citation Awareness http://ands.org.au/guides/data-citation-awareness.html
- D9.3 Draft Report on Incentives and Rewards in the Field of Biomedical Research Databases http://www.gen2phen.org/system/files/private/D9.3%20Draft%20Report%20on%...
Standards and tools to transform data from SQL to RDF
I recommend to RDF beginners to start with tools which implements a Direct Mapping from a database. Direct Mapping is defined as a mapping that mirrors the database schema in RDF with a minimal effort required to implement it. There have also been efforts to let users annotate the SQL code to provide the same capability e.g the work done by the FlyWeb project
- My first mapping from RDB to RDF using a direct mapping http://ivan-herman.name/2010/11/19/my-first-mapping-from-direct-mapping/
- Future of FlyWeb work on Chado OWL ontology and RDF mapping http://generic-model-organism-system-database.450254.n5.nabble.com/Futur...
These tools are not only interesting for users willing to transform an existing database. MOLGENIS http://sourceforge.net/projects/molgenis/ is a tool to allow users to create their own TAB-delimited format to record science data and then move it into RDF (using D2RQ) to allow other apps to process it. The starting point is a couple of XML files (one for the data model and one for the UI) with a simple syntax (there is a utility to extract the data model one from an existing database). Out of these original "models", the MOLGENIS project aims to derive a range of tools including a R API and RDF access using D2RQ in a comparable way to what has been done in the FlyWeb project.
LOD and ontology
The gap between "lightweight semantics" like LOD and ontology-based approach is much smaller than it used to be. The Datatype Reasoning capabilities enabled by the OWL2 standard http://www.w3.org/TR/owl2-overview/ and the new features provided by ontology engineering tools like SPARQL-DL http://www.w3.org/2001/sw/wiki/SPARQL-DL can help LOD users to exploit ontology content even if it is mixed with numerical data.
Graph databases (NoSQL)
Finally, Graph databases may also have a role to play in future Linked Open Data infrastructures because there are new (and also old) products now fighting over a market niche which is roughly half way between traditional databases and triple stores.
Sandro Hawke Toward Standards for NoSQL NoSQL Live … from Boston March 11, 2010 http://www.w3.org/2010/Talks/0311-nosql/talk.pdf