Recent discussions concerning Google and the BNF have foregrounded digital issues. But maybe that debate is really only a part of the background to current developments on the web and the Internet. The BNF was already putting documents on line at the end of 1997, while Page and Brin did not found their little Google “start-up” until 1998 ; ten years later, Google is valued at 210 billion US dollars. But that is not what counts. Created at CERN by Tim Berners Lee in 1990. the World Wide Web is a way of organizing information hypertextually, making it possible to link and to discover documents across multiple systems and platforms in a universal system. It’s also worth remembering that the HTTP protocol has developed in response to scientific needs, subsequently generalized to encompass all human activity. In the same way, the evolution of the “web of documents” into collaborative Web 2.0 applications at the end of the nineties was driven by scientific applications, collaborative uses, the establishment of virtual communities or invisible colleges, networks of active participants in the forefront of collaboration, before it was extended and generalized into all spheres of society, communication and leisure.
The challenge is now to build the Data Web. Richer and more complex than previous forms of the web, this represents an unprecedented advance, both quantitatively and qualitatively, with socio-economic consequences as yet untold. At the moment, a data revolution is taking place in the world of scientific knowledge. Researchers, students, and teachers are eager for digital documents of all kinds (texts, databases, iconographies, maps...), and bibliographic applications thus converge. But HTML (even extended as XML) cannot overcome the “document barrier” we have inherited from bygone centuries of analogue data. With respect to the communicative potential of the web, data seems to be locked away in relational databases. Underlying the creation of knowledge, there is an exponential increase in data creation : sensors, probes, simulators, cameras, automated surveillance systems, and computational systems are accumulating data uncontrollably, beyond the scale of human understanding. In social science, bits of information are produced and exchanged by the subjects of study themselves, breaking down the frontier between private and public. Virtual communities are engaged in unlimited processes of enrichment and large-scale data analysis (in genetic and medical sciences, in climate studies, ecology, economics etc.) . Finally, there are objects or systems which store information (including details of daily life) which are able to mass-produce and mass-exchange data (RFID)
Data goes public
The Data Web, also called the Semantic Web by Tim Berners-Lee, creates data independence, in a kind of world-wide digital heaven. Data becomes linked data by means of its expression in RDF (Resource Description Format, a W3C standard), in the form of a triplet (subject, predicate, object), a kind of logical “syntax” which makes it possible to link together data at the heart of the web, readable by computer, and independent of its originating site. That assumes the existence of public distribution policies, “opening up” databases, and making them available to new search engines (SPARQL), in the way that Obama’s data.gov program is being undertaken in the USA. Wikipedia already supplies its data in RDF (Dbpedia), thus making available millions of links for its dataset – a pre-defined collection of reference and thesaurus data. “Raw” data can be virtually linked to provide massive graphic visualisations. Google is also taking this route of tagging HTML pages with RDF. Tim Berners Lee is energetically campaigning for the availability of independent, autonomous data — “Raw data now” – in order to develop the web of the future. For example, a researcher will be able to combine epidemiological data on the one hand with socio-economic data on the other,to create a new research field, independent of the originating web sites. “Liberated” datasets for evaluation on this new web – this is the new knowledge economy.
Data and the Publishing Revolution
The Data Web subverts the “peer-review” model of scientific communication jointly created in 1665 by the Journal des Scavans in France and the Philosophical Transactions in England. Scientific knowledge is evaluated less by its “results” than by the way its data has been processed for dynamic presentation, or by simulation of its processing. As of today, open access models, in which publications are valued for their contribution to an “upstream” availability of data rather than for their “downstream” marketability, are breaking down traditional publishing business models. Major publishers such as Elsevier have understood this in proposing data-centric publishing processes (the “Article of the future”, a form of article made multi-functional by its associated data). In the same way, Thomson-Reuters have launched an application suite for link management called “Open Calais”. Robert Darnton announced this revolution in publishing a decade ago. The online publishing of documents linked to their source with critical annotation. The revolution is now possible and underway, thanks to the Data Web.
Infrastructures for the Data Web
Of course, these three stages of the Web are not mutually exclusive. They come about in successive waves, textual and alphanumeric datasets, flowing into digital libraries – the digital edition – and then into the heart of scientific data-focussed work, collaborative and generalized on a planetary scale. The conversion of scientific databases into the Data Web is at the heart of the development of digital infrastructures for the sciences : publishing platforms, Web2 collaborations (Blogs, Wiki, etc) and processing environments (portals, processors, relational databases) ; they must be platforms where services can be integrated, supporting convergence of the roles of different service providers, and on the other hand, interconnecting heterogeneous laboratory data aggregated from distributed sites. Like the US, France and Europe are engaged in major “road maps” (ESFRI) to prepare major digital infrastructures. Let us hope that at this strategic moment in the development of the Data Web, implementation of these infrastructures will be taken into account, in particular in the evaluation of future national funding plans and investment
Yannick Maignien Tge Adonis Novembre 2009.
Photographie sous licence CC par Caveman 92223/flickr
Commentaires