In a previous
Guest post by Lin Clark and Michael Hausenblas, DERI
In this, the Petabyte Age, technologists have a growing obsession with data—Big data. But data isn’t just the province of trained specialists anymore. Data is changing the way scientists research and the way that journalists investigate; the way government officials report their progress and the way citizens participate in their own governance.
The challenge that all of these accidental technologists face is how to surface data and bring data together in meaningful ways. As Google’s chief economist Hal Varian has said, the scarce factor is no longer the data, which is essentially free and ubiquitous, but now the “scarce factor is the ability to understand that data and extract value from it.”
The emerging Web of Linked Data is the largest source of this data—multi-domain, real-world and real-time data—that currently exists. As data integration and information quality assessment increasingly depends on the availability of large amounts of real-world data, these new technologists are going to need to find ways to connect to the Linked Open Data (LOD) cloud.
With the explosive growth of the LOD cloud, which has doubled in size every 10 months since 2007, utilising this global data space in a real-world setup has proved challenging; the amount and quality of the links between LOD sources remains sparse and there is not a well-documented and cohesive set of tools that enables individuals and organisations to easily produce and consume Linked Data.
A new project aims to change this, making it easier to connect to the LOD cloud by offering support to data owners, Web developers who build applications with Linked Data, and small and medium enterprises that want to benefit from the lightweight data integration possibilities of Linked Data.LATC to the Rescue
The new LOD Around-the-Clock (LATC) project kicked off on September 13-14, 2010 at the Digital Enterprise Research Institute in Galway, Ireland. LATC brings together a team of Linked Data researchers and practitioners from DERI (National University of Ireland Galway), Vrije Universiteit Amsterdam, Freie Universität Berlin, Institut für Angewandte Informatik, and Talis.
This team will support the production and consumption of Linked Data by providing:
- A recommended tools library for publishing and consuming Linked Data, supplementing documentation for the tools, and free implementation support for large-scale data publishers and consumers. Tools include the D2R Server for publishing relational databases on the Semantic Web, the Drupal CMS and related publishing and consupmtion tools, and others.
- A 24/7 interlinking platform (see Fig. 1) that acquires new data and creates links between existing datasets in the LOD cloud.
- Publication of new large-scale LOD datasets with data from governmental departments and other organizations. The focus will be on EU level datasets such as CORDIS, the European Patent Office, and Eurostat.
Total cost: 1.19 M€
EU contribution: 1.06 M€
Dr. Michael Hausenblas
IDA Business Park, Galway, Ireland
Tel. +353 91 495730
In addition to the core team, a large Advisory Committee with more than 30 members will participate in the LATC activities and connect the Linked Data community to LATC’s recommended tools library and support services. Organizations on the Advisory Committee are entitled to support from the project and thus will be in a position to give feedback to improve the support services. The Advisory Committee includes governmental organisations such as the UK Office of Public Sector Information and the European Environment Agency; researchers and practitioners such as the University of Manchester, University of Economics Prague, Vulcan Inc., CTIC Technological Center, the Open Knowledge Foundation; and standardisation bodies, including W3C (Tim Berners-Lee). The LATC partners will also liaison with other EC projects and related activities, including LOD2, PlanetData, SEALS, datalift.org, Semic.EU, OKFN, and the Pedantic Web group.
LATC organises and supports a number of community events, including tutorials at the International Semantic Web Conference 2010 in Shanghai, China, as well as the Open Government Data Camp, London.
LATC is a Support Action funded under the European Commission FP7 ICT Work Programme, within the Intelligent Information Management objective (ICT-2009.4.3).
“You’ll end up writing a database” said Dan Brickley prophetically in early 2000. He was of course, correct. What started as an RDF/XML parser and a BerkeleyDB-based triple store and API, ended up as a much more complex system that I named Redland with the librdf API. It does indeed have persistence, transactions (when using a relational database) and querying. However, RDF query is not quite the same thing as SQL since the data model is schemaless and graph centric so when RDQL and later SPARQL came along, Redland gained a query engine component in 2003 named Rasqal: the RDF Syntax and Query Library for Redland. I still consider it not a 1.0 library after over 7 years of work.Query Engine The First
The first query engine was written to execute RDQL which today looks like a relatively simple query language. There is one type of SELECT query returning sequences of sets of variable bindings in a tabular result like SQL. The query is a fixed pattern and doesn’t allow any optional, union or conditional pattern matching. This was relatively easy to implement in what I’ve called a static execution model:
- Break the query up into a sequence of triple patterns: triples that can include variables in any position which will be found by matching against triples. A triple pattern returns a sequence of sets of variable bindings.
- Match each of the triple patterns in order, top to bottom, to bind the variables.
- If there is a query condition like ?var > 10 then check that it evaluates true.
- Return the result.
- Repeat at step #2.
The only state that needed saving was where in the sequence of triple patterns that the execution had got to – pretty much an integer, so that the looping could continue. When a particular triple pattern was exhausted it was reset, the previous one incremented and the execution continued.
This worked well and executes all of RDQL no problem. In particular it was a lazy execution model – it only did work when the application asked for an additional result. However, in 2004 RDF query standardisation started and the language grew.Enter The Sparkle
The new standard RDF query language which was named SPARQL had many additions to the static patterns of the RDQL model, in particular it added OPTIONAL which allowed optionally (sic) matching an inner set of triple patterns (a graph pattern) and binding more variables. This is useful in querying heterogeneous data when there are sometimes useful bits of data that can be returned but not every graph has it.
This meant that the engine had to be able to match multiple graph patterns – the outer one and any inner optional graph pattern – as well as be able to reset execution of graph patterns, when optionals were retried. Optionals could also be nested to an arbitrary depth.
This combination meant that the state that had to be preserved for getting the next result became a lot more complex than an integer. Query engine #1 was updated to handle 1 level of nesting and a combination of outer fixed graph pattern plus one optional graph pattern. This mostly worked but it was clear that having the entire query have a fixed state model was not going to work when the query was getting more complex and dynamic. So query engine #1 could not handle the full SPARQL Optional model and would never implement Union which required more state tracking.
This meant that Query Engine #1 (QE1) needed replacing.Query Engine The Second
The first step was a lot of refactoring. In QE1 there was a lot of shared state that needed pulling apart: the query itself (graph patterns, conditions, the result of the parse tree), the engine that executed it and the query result (sequence of rows of variable bindings). That needed pulling apart so that the query engine could be changed independent of the query or results.
Rasqal 0.9.15 at the end of 2007 was the first release with the start of the refactoring. During the work for that release it also became clear that an API and ABI break was necessary as well to introduce a Rasqal world object, to enable proper resource tracking – a lesson hard learnt. This was introduced in 0.9.16.
There were plenty of other changes to Rasqal going on outside the query engine model such as supporting reading and writing result formats, providing result ordering and distincting, completing the value expression and datatype handling data and general resilience fixes.
The goals of the refactoring were to produce a new query engine that was able to execute a more dynamic query, be broken into understandable components even for complex queries, be testable in small pieces and to continue to execute all the queries that QE1 could do. It should also continue to be a lazy-evaluation model where the user could request a single result and the engine should do the minimum work in order to return it.Row Sources and SPARQL
The new query engine was designed around a new concept: a row source. This is an active object that on request, would return a row of variable bindings. It generates what corresponds to a row in a SQL result. This active object is the key for implementing the lazy evaluation. At the top level of the query execution, there would be basically one call to top_row_source.getRow() which itself calls inner rowsources’ getRow() in order to execute the query to return the next result.
Each rowsource would correspond approximately to a SPARQL algebra concept, and since the algebra had a well defined way to turn a query structure into an executable structure, or query plan, the query engine’s main role in preparation of the query was to become a SPARQL query algebra implementation. The algebra concepts were added to Rasqal enabling turning the hierarchical graph pattern structure into algebra concepts and performing the optimization and algebra transformations in the specification. These transformations were tested and validated against the examples in the specification. The resulting tree of “top down” algebra structures were then used to build the “bottom up” rowsource tree.
The rowsource concept also allowed breaking up the complete query engine execution into understandable and testable chunks. The rowsources implemented at this time include:
- Assignment: allowing binding of a new variable from an input rowsource
- Distinct: apply distinctness to an input rowsource
- Empty: returns no rows; used in legitimate queries as well as in transformations
- Filter: evaluates an expression for each row in an input rowsource and passes on those that return True.
- Graph: matches against a graph URI and/or bind a graph variable
- Join: (left-)joins two inner rowsources, used for OPTIONAL.
- Project: projects a subset of input row variables to output row
- Row Sequence: generates a rowsource from a static set of rows
- Sort: sort an input rowsource by a list of order expressions
- Triples: match a triple pattern against a graph and generate a row. This is the fundamental triple pattern or Basic Graph Pattern (BGP) in SPARQL terms.
- Union: return results from the two input rowsources, in order
The QE1 entry point was refactored to look like getRow() and the query engines were tested against each other. In the end QE2 was identical, and eventually QE2 was improved such that it passed more DAWG SPARQL tests that than QE1.
So in summary QE2 works like this:
- Parse the query string into a hierarchy of graph patterns such as basic, optional, graph, group, union, filter etc. (This is done in rasqal_query_prepare())
- Create a SPARQL algebra expression from the graph pattern tree that describes how to evaluate the query. (This is in rasqal_query_execute() calling QE2 )
- Invert the algebra expression to a hierarchy of rowsources where the top rowsource getRow() call will evaluate the entire query (Ditto)
(If you want to see some of the internals on a real query, run roqet -d debug query.rq from roqet built in maintainer mode and both the query structure and algebra version will be generated.
The big advantage from a maintenance point of view is that it is divided into small understandable components that can be easily added to.
The result was released in Rasqal 0.9.17 at the end of 2009; 15 months after the previous release. It’s tempting to say nobody noticed the new query engine except that it did more work. There is no way to use the old query engine except by a configure argument when building it. The QE1 code is never called and should be removed from the sources.Example execution
is described in the following picture if you follow the numbers in order:
This doesn’t include details of content negotiation, base URIs, result formatting, or the internals of the query execution described above.SPARQL 1.1
Now it is the end of 2010 and SPARQL 1.1 work is underway to update the original SPARQL Query which was complete in January 2008. It is a substantial new version that adds greatly to the language. In the SPARQL 1.1 2010-10-14 draft version it adds (these items may or may not be in the final version):
- Assignment with BIND(expr AS ?var)
- Aggregate expressions such as SUM(), COUNT() including grouping and group filtering with HAVING
- Negation between graph patterns using MINUS.
- Property path triple matching.
- Computed select expressions: SELECT ... (expr AS ?var)
- Federated queries using SERVICE to make SPARQL HTTP query requests
- Sub SELECT and BINDINGS to allow queries/results inside queries.
- Updates allowing insertion, deletion and modification of graphs via queries as well as other graph and dataset management
The above is my reading of the major items in the latest draft SPARQL 1.1 query language or it’s dependent required specifications.Rasqal next steps
So does SPARQL 1.1 mean Rasqal Query Engine 3? Not yet, although the Rasqal API is still changing too much to call it stable and another API/ABI break is possible. There’s also the question of making an optimizing query engine, a more substantial activity. At this time, I’m not motivated to implement property paths since it seems like a lot of work and there are other pieces I want to do first. Rasqal in GIT handles most of the syntax and is working towards implementing most of the execution of aggregate expressions, sub selects and SERVICE although no dates yet. I work on Rasqal in my spare time when I feel like it, so maybe it won’t be mature with a stable API (which would be a 1.0) until SPARQL 2 rolls by.
The opening conference keynote presentation this year comes from Dion Hinchcliffe, Senior Vice President of Dachis Group. Dion is an internationally recognized business strategist and enterprise architect with an extensive track record of building enterprise solutions and strategies for clients in the Fortune 500, federal government, and Internet start-up community.
In this conversation we explore the impact of web and social technologies and their impact, challenge, and opportunity when applied to the enterprise.
There is intense interest currently in Apple’s success with a walled market for apps you install locally on the device. Developers get a route to market, and Apple helps with monetization in return for a substantial share in the money.
The challenge is to extend this to the web at large and make it scale across devices from different vendors. Users shouldn’t have to care about whether the app is locally installed, or downloaded on the fly from the cloud.
Today, many web apps are tied to websites, e.g. google docs is tied to the use of google’s server docs.google.com. End users don’t have a free choice in where apps run, and lack control over where their data resides.
Imagine a market where I can choose an app/service and have it run on my own virtual server. This is akin to taking the idea of a device and expanding it into the cloud. My personal device includes my personal space in the cloud. I buy apps for my personal use and “install them” in this personal space. My personal space can include all of the devices I use, including mobile, desktop, tv and car. I may share my space with others, e.g. my family, friends or colleagues.
This model introduces new players and enriches the ecosystem compared with today’s narrower model, creating broader opportunities for developers. What’s needed to realize this vision?
- Smarter caching and local storage for web pages will blur the distinction between online and locally installed web apps
- Support for monetization, which is likely to necessitate some form of Web Application License Language
I am encouraged by the announcement of Mozilla Open Web Apps, and hope to explore these ideas further as part of an EU funded project called webinos which has only recently started with a view to making it easier to deliver apps across mobile, desktop, tv and cars.
Simon Rogers, editor of The Guardian’s Datablog, last week posted a top 10 list data.gov.uk datasets by how they could be relevant to people, highlighting a number of very interesting data sets. He featured national transport statistics, a massive data set cataloging not only every bus, rail, coach stop or pier in the UK but every bus, train, tram, or ferry that docked for a week in October; as well as statistics on government spending (COINS), the UK labour market / employment statistics by year, youth perspectives and attitudes by region, and statistics on dog messes by UK region. For each he describes a little synopsis of the contents of the data set, highlights its potential uses, and describes problems/limitations of the data.
This post is of tremendous use not because it merely serves to highlight a tiny, delicious morsel from a rather immense soup of more than 4,223 datasets on data.gov.uk, but because helps make it relevant to people: he describes (in easily human-understandable terms) what is in each data set, why the data is relevant or interesting, and most importantly, ways that it can be put to use.
The chasm between publishing and use is currently large and daunting. Many of the data sets are in “raw” form, Excel spreadsheets created by public servants (using highly specialized government vocabulary) or immense, multi-gigabyte CSV files with little supporting documentation. What we see now is a gold rush (on both sides of the Atlantic – data.gov and data.gov.uk) of citizen-hackers who are downloading this data, writing scripts to parse through it, and generating visualisations and apps that make it possible for end-users to actually use it in various ways.
But the Guardian Datablog highlights that this might be an ideal role for journalism to come in as well – while citizen-hackers have been effective at rolling mash-ups that let everyday people get at the data, it still takes a journalist/reporter to get tasty bits out of it — to transform the raw bits into information – speculation, perspective, and to contextualize it in world/current events, and to weave it into a story that leads people to question what the data say about the ways they live each day.
These data journalists, of course, do not have to come from Big Media (TV, newspapers) as such – the ones that do just happen to be best equipped with the right set of skills. In the future, it would be interesting to see whether the many, emerging sense-making and visualisation tools, such as ManyEyes, Google Fusion Tables , Freebase Gridworks, and our own work, enAKTing’s GEORDI browser (forthcoming), could make data-journalism more accessible to citizens without a background in statistical data analysis or a journalism degree. If so, these tools could unleash masses of newly equipped citizen-journalists on the terabytes of open data now publicly available, so that it can be more immediately transformed into information that can start to make an difference in people’s lives.
KiWi, the Open Source development platform for building Semantic Social Media Applications, offers features required for Social Media applications such as versioning, (semantic) tagging, rich text editing, easy linking, rating and commenting, as well as advanced “smart” services such as recommendations, rule-based reasoning, information extraction, intelligent search and querying, a sophisticated social reputation system, vocabulary management, and rich visualisation.
To make sure, that KiWi does not die, after the closure of the EC-funded periode, the project makes effort to form a community. The release party was thus also an opportunity to get in touch with the project team. Another opportunity to get in touch with the Software and it’s developers behind is in February next year. When KiWi Snow Camp will gonna be somewhere in the Salzburg mountains.
- which have a good idea on how semantic technologies can make social media hit the target?
- and are inspired by the possibilities of the KiWi platform?
Together with the KiWi Team participants will meet in February 2011 in Salzburg’s mountains to develop ideas, programm, discuss and develop amazing new pieces of code – and of course enjoy the skiing experience. Not to mention receive the glory of recognition from others in the open source communities and within the broader semantic web community.
How to get my trip to the KiWi Snow Camp?
You will need to register as a participant for the KiWi Developer Challenge. Please email firstname.lastname@example.org to register your intention to participate in the Challenge; if you are not already registered on KiWi Community site, please do so and include a brief biography.
Just a quick note to mention that the Linking Enterprise Data book is now available online. Along with Tom Scott, Silver Oliver, Patrick Sinclair and Michael Smethurst, we wrote a chapter on the use of Semantic Web technologies at the BBC, which expands on the W3C Case Study we wrote at the beginning of the year. If you're interested in how Semantic Web technologies were used to build BBC Programmes, BBC Music and Wildlife Finder, make sure you read it (I also noticed it was available for pre-order on Amazon).
I have just released Redland librdf library version 1.0.11 which has been in progress for some time, delayed by the large amount of work to get out Raptor V2 as well as initial SPARQL 1.1 draft work for Rasqal 0.9.20.
The main features in this release are as follows:
- Virtuoso storage backend querying now fully works.
- Several new convenience APIs were added and others deprecated.
- Support building with Raptor V2 API if configured with --with-raptor2.
- Exports more functions to SWIG language bindings.
- Switched to GIT version control hosted by GitHub.
- Fixed Issues: #0000124, #0000284, #0000321, #0000322, #0000334, #0000338, #0000341, #0000344, #0000350, #0000363, #0000366, #0000371, #0000380, #0000382 and #0000383
See the Redland librdf 1.0.11 Release Notes for the full details of the changes.
Note that the Redland language bindings 184.108.40.206 works fine with Redland librdf 1.0.11 but the bindings will soon have a release to match.
I had the opportunity the other day to converse about the semantic technology business proposition in terms of business development. My interlocutor was a business development consultant who had little prior knowledge of this technology but a background in business development inside a large diversified enterprise.
I will here recap some of the points discussed, since these can be of broader interest.Why is there no single dominant vendor?
The field is young. We can take the relational database industry as a historical precedent. From the inception of the relational database around 1970, it took 15 years for the relational model to become mainstream. "Mainstream" here does not mean dominant in installed base, but does mean something that one tends to include as a component in new systems. The figure of 15 years might repeat with RDF, from around 1990 for the first beginnings to 2015 for routine inclusion in new systems, where applicable.
This does not necessarily mean that the RDF graph data model (or more properly, EAV+CR; Entity-Attribute-Value + Classes and Relationships) will take the place of the RDBMS as the preferred data backbone. This could mean that RDF model serialization formats will be supported as data exchange mechanisms, and that systems will integrate data extracted by semantic technology from unstructured sources. Some degree of EAV storage is likely to be common, but on-line transactional data is guaranteed to stay pure relational, as EAV is suboptimal for OLTP. Analytics will see EAV alongside relational especially in applications where in-house data is being combined with large numbers of outside structured sources or with other open sources such as information extracted from the web.
EAV offerings will become integrated by major DBMS vendors, as is already the case with Oracle. Specialized vendors will exist alongside these, just as is the case with relational databases.
Can there be a positive reinforcement cycle (e.g., building cars creates a need for road construction, and better roads drive demand for more cars)? Or is this an up-front infrastructure investment that governments make for some future payoff or because of science-funding policies?
The Document Web did not start as a government infrastructure initiative. The infrastructure was already built, albeit first originating with the US defense establishment. The Internet became ubiquitous through the adoption of the Web. The general public's adoption of the Web was bootstrapped by all major business and media adopting the Web. They did not adopt the web because they particularly liked it, as it was essentially a threat to the position of media and to the market dominance of big players who could afford massive advertising in this same media. Adopting the web became necessary because of the prohibitive opportunity cost of not adopting it.
A similar process may take place with open data. For example, in E-commerce, vendors do not necessarily welcome easy-and-automatic machine-based comparison of their offerings against those of their competitors. Publishing data will however be necessary in order to be listed at all. Also, in social networks, we have the identity portability movement which strives to open the big social network silos. Data exchange via RDF serializations, as already supported in many places, is the natural enabling technology for this.
Will the web of structured data parallel the development of web 2.0?
Web 2.0 was about the blogosphere, exposure of web site service APIs, creation of affiliate programs, and so forth. If the Document Web was like a universal printing press, where anybody could publish at will, Web 2.0 was a newspaper, bringing the democratization of journalism, creating the blogger, the citizen journalist. The Data Web will create the Citizen Analyst, the Mini Media Mogul (e.g., social-network-driven coops comprised of citizen journalists, analysts, and other content providers such as video and audio producers and publishers). As the blogosphere became an alternative news source to the big media, the web of data may create an ecosystem of alternative data products. Analytics is no longer a government or big business only proposition.
Is there a specifically semantic market or business model, or will semantic technology be exploited under established business models and merged as a component technology into existing offerings?
We have seen a migration from capital expenses to operating expenses in the IT sector in general, as exemplified by cloud computing's Platform as a Service (PaaS) and Software as a Service (SaaS). It is reasonable to anticipate that this trend will continue to Data as a Service (DaaS). Microsoft Odata and Dallas are early examples of this and go towards legitimizing the data as service concept. DaaS is not related to semantic technology per se, but since this will involve integration of data, RDF serializations will be attractive, especially given the takeoff of linked data in general. The data models in Odata are also much like RDF, as both stem from EAV+CR, which makes for easy translation and a degree of inherent interoperability.
The integration of semantic technology into existing web properties and business applications will manifest to the end user as increased serendipity. The systems will be able to provide more relevant and better contextualized data for the user's situation. This applies equally to the consumer and business user cases.
Identity virtualization in the forms of WebID and Webfinger — making first-class de-referenceable identifiers of mailto: and acct: schemes — is emerging as a new way to open social network and Web 2.0 data silos.
On the software production side, especially as concerns data integration, the increased schema- and inference-flexibility of EAV will lead to a quicker time to answer in many situations. The more complex the task or the more diverse the data, the higher the potential payoff. Data in cyberspace is mirroring the complexity and diversity of the real world, where heterogeneity and disparity are simply facts of life, and such flexibility is becoming an inescapable necessity.
The RDFa Working Group has published a new draft for the RDFa API. The API is specified in WebIDL, and is primarily aimed at ECMAScript applications that want to include structured information management into their Web Application. The design of the system is modular and allows multiple pluggable extraction and storage mechanisms supporting not only RDFa, but also Microformats, Microdata, and other structured data formats.
This is the second public Working Draft of the document. The current document contains a summary of changes and also a number of open issues. Feedback on those from the community would be very welcome. Please, send your comments to email@example.com (subscribe, archives).
I will begin by extending my thanks to the organizers, in specific Reto Krummenacher of STI and Atanas Kiryakov of Ontotext for inviting me to give a position paper at the workshop. Indeed, it is the builders of bridges, the pontifs (pontifex) amongst us who shall be remembered by history. The idea of organizing a semantic data management workshop at VLDB is a laudable attempt at rapprochement between two communities to the advantage of all concerned.
I talked about making RDF cost competitive with relational for data integration and BI. The crux is space efficiency and column store techniques.
One question that came up was that maybe RDF could approach relational in some things, but what about string literals being stored in a separate table? Or URI strings being stored in a separate table?
The answer is that if one accesses a lot of these literals the access will be local and fairly efficient. If one accesses just a few, it does not matter. For user-facing reports, there is no point in returning a million strings that the user will not read anyhow. But then it turned out that there in fact exist reports in bioinformatics where there are 100,000 strings. Now taking the worst abuse of SPARQL, a regexp over all literals in a property of a given class. With a column store this is a scan of the column; with RDF, a three table join. The join is about 10x slower than the column scan. Quite OK, considering that a full text index is the likely solution for such workloads anyway. Besides, a sensible relational schema will also not use strings for foreign keys, and will therefore incur a similar burden from fetching the strings before returning the result.
Another question was about whether the attitude was one of confrontation between RDF and relational and whether it would not be better to join forces. Well, as said in my talk, sauce for the goose is sauce for the gander and generally speaking relational techniques apply equally to RDF. There are a few RDB tricks that have no RDF equivalent, like clustering a fact table on dimension values, e.g., sales ordered by country, manufacturer, month. But by and large, column-store techniques apply. The execution engine can be essentially identical, just needing a couple of extra data types and some run-time typing and in some cases producing nulls instead of errors. Query optimization is much the same, except that RDB stats are not applicable as such; one needs to sample the data in the cost model. All in all, these adaptations to a RDB are not so large, even though they do require changes to source code.
Another question was about combining data models, e.g., relational (rows and columns), RDF (graph), XML (tree), and full text. Here I would say that it is a fault of our messaging that we do not constantly repeat the necessity of this combining, as we take it for granted. Most RDF stores have a full text index on literal values. OWLIM and a CWI prototype even have it for URIs. XML is a valid data type for an RDF literal, even though this does not get used very much. So doing SPARQL to select the values, and then doing XPath and XSLT on the values, is entirely possible, at least in Virtuoso which has an XPath/XSLT engine built in. Same for invoking SPARQL from an XSLT sheet. Colocating a native RDBMS with local and federated SQL is what Virtuoso has always done. One can, for example, map tables in heterogenous remote RDBs into tables in Virtuoso, then map these into RDF, and run SPARQL queries that get translated into SQL against the original tables, thereby getting SPARQL access without any materialization. Alongside this, one can ETL relational data into RDF via the same declarative mapping.
Further, there are RDF extensions for geospatial queries in Virtuoso and AllegroGraph, and soon also in others.
With all this cross-model operation, RDF is definitely not a closed island. We'll have to repeat this more.
Let us talk about SpiderStore first.SpiderStore
The SpiderStore from the University of Innsbruck is a main-memory-only system that has a record for each distinct IRI. The IRI record has one array of pointers to all IRI records that are objects where the referencing record is the subject, and a similar array of pointers to all records where the referencing record is the object. Both sets of pointers are clustered based on the predicate labeling the edge.
According to the authors (Robert Binna, Wolfgang Gassler, Eva Zangerle, Dominic Pacher, and Günther Specht), a distinct IRI is 5 pointers and each triple is 3 pointers. This would make about 4 pointers per triple, i.e., 32 bytes with 64-bit pointers.
This is not particularly memory efficient, since one must count unused space after growing the lists, fragmentation, etc., which will make the space consumption closer to 40 bytes per triple, plus should one add a graph to the mix one would need another pointer per distinct predicate, adding another 1-4 bytes per triple. Supporting non-IRI types in the object position is not a problem, as long as all distinct values have a chunk of memory to them with a type tag.
We get a few times better memory efficiency with column compressed quads, plus we are not limited to main memory.
But SpiderStore has a point. Making the traversal of an edge in the graph into a pointer dereference is not such a bad deal, especially if the data set is not that big. Furthermore, compiling the queries into C procedures playing with the pointers alone would give performance to match or exceed any hard coded graph traversal library and would not be very difficult. Supporting multithreaded updates would spoil much of the gain but allowing single threaded updates and forking read-only copies for reading would be fine.
SpiderStore as such is not attractive for what we intend to do, this being aggregating RDF quads in volumes far exceeding main memory and scaling to clusters. We note that SpiderStore hits problems with distributed memory, since SpiderStore executes depth first, which is manifestly impossible if significant latencies are involved. In other words, if there can be latency, one must amortize by having a lot of other possible work available. Running with long vectors of values is one way, as in MonetDB or Virtuoso Cluster. The other way is to have a massively multithreaded platform which favors code with few instructions but little memory locality. SpiderStore could be a good fit for massive multithreading, specially if queries were compiled to C, dramatically cutting down on the count of instructions to execute.
We too could adopt some ideas from SpiderStore. Namely, if running vectored, one just in passing, without extra overhead, generates an array of links to the next IRI, a bit like the array that SpiderStore has for each predicate for the incoming and outgoing edges of a given IRI. Of course, here these would be persistent IDs and not pointers, but a hash from one to the other takes almost no time. So, while SpiderStore alone may not be what we are after for data warehousing, Spiderizing parts of the working set would not be so bad. This is especially so since the Spiderizable data structure almost gets made as a by-product of query evaluation.
If an algorithm made several passes over a relatively small subgraph of the whole database, Spiderizing it would accelerate things. The memory overhead could have a fixed cap so as not to ruin the working set if locality happened not to hold.
Running a SpiderStore-like execution model on vectors instead of single values would likely do no harm and might even result in better cache behavior. The exception is in the event of completely unpredictable patterns of connections which may only be amortized by massive multithreading.Webpie
Webpie from VU Amsterdam and the LarKC EU FP 7 project is, as it were, the opposite of SpiderStore. This is a map-reduce-based RDFS and OWL Horst inference engine which is all about breadth-first passes over the data in a map-reduce framework with intermediate disk-based storage.
Webpie is not however a database. After the inference result has been materialized, it must be loaded into a SPARQL engine in order to evaluate a query against the result.
The execution plan of Webpie is made from the ontology whose consequences must be materialized. The steps are sorted and run until a fixed point is reached for each. This is similar to running SPARQL INSERT … SELECT statements until no new inserts are produced. The only requirement is that the INSERT statement should report whether new inserts were actually made. This is easy to do. In this way, a comparison between map-reduce plus memory-based joining and a parallel RDF database could be made.
We have suggested such an experiment to the LarKC people. We will see.
Most queries are logarithmic to scale factor, but some are linear. The linear ones come to dominate the metric at larger scales.
An update stream would make the workload more realistic.
We could rectify this all with almost no changes to the data generator or test driver by adding one or two more metrics.
So I am publishing the below as a starting point for discussion.BSBM Analytics Mix
Below is a set of business questions that can be answered with the BSBM data set. These are more complex and touch a greater percentage of the data than the initial mix. Their evaluation is between linear and n * log(n) to the data size. The TPC-H rules can be used for a power (single user) and a throughput (multi-user, where each submits queries from the mix with different parameters and in different order). The TPC-H score formula and executive summary formats are directly applicable.
This can be a separate metric from the "restricted" BSBM score. Restricted means "without a full scan with regexp" which will dominate the whole metric at larger scales.
Vendor specific variations in syntax will occur, hence these are allowed but disclosure of specific query text should accompany results. Hints for JOIN order and the like are not allowed; queries must be declarative. We note that both SPARQL and SQL implementations of the queries are possible.
The queries are ordered so that the first ones fill the cache. Running the analytics mix immediately after backup post initial load is allowed, resulting in semi-warm cache. Steady-state rules will be defined later, seeing the characteristics of the actual workload.
For each country, list the top 10 product categories, ordered by the count of reviews from the country.
Product with the most reviews during its first month on the market
10 products most similar to X, with similarity score based on the count of features in common
Top 10 reviewers of category X
Product with largest increase in reviews in month X compared to month X-minus-1.
Product of category X with largest change in mean price in the last month
Most active American reviewer of Japanese cameras last year
Correlation of price and average review
Features with greatest impact on price — for features occurring in category X, find the top 10 features where the mean price with the feature is most above the mean price without the feature
Country with greatest popularity of products in category X — reviews of category X from country Y divided by total reviews
Leading product of category X by country, mentioning mean price in each country and number of offers, sort by number of offers
Fans of manufacturer — find top reviewers who score manufacturer above their mean score
Products sold only in country X
Since RDF stores often implement a full text index, and since a full scan with regexp matching would never be used in an online E-commerce portal, it is meaningful to extend the benchmark to have some full text queries.
For the SPARQL implementation, text indexing should be enabled for all string-valued literals even though only some of them will be queried in the workload.
Q6 from the original mix, now allowing use of text index.
Reviews of products of category X where the review contains the names of 1 to 3 product features that occur in said category of products; e.g., MP3 players with support for mp4 and ogg.
ibid but now specifying review author. The intent is that structured criteria are here more selective than text.
Difference in the frequency of use of "awesome", "super", and "suck(s)" by American vs. European vs. Asian review authors.
For full text queries, the search terms have to be selected according to a realistic distribution. DERI has offered to provide a definition and possibly an implementation for this.
The parameter distribution for the analytics queries will be defined when developing the queries; the intent is that one run will touch 90% of the values in the properties mentioned in the queries.
The result report will have to be adapted to provide a TPC-H executive summary-style report and appropriate metrics.Changes to Data Generation
For supporting the IR mix, reviews should, in addition to random text, contain the following:
For each feature in the product concerned, add the label of said feature to 60% of the reviews.
Add the names of review author, product, product category, and manufacturer.
The review score should be expressed in the text by adjectives (e.g., awesome, super, good, dismal, bad, sucky). Every 20th word can be an adjective from the list correlating with the score in 80% of uses of the word and random in 20%. For 90% of adjectives, pick the adjectives from lists of idiomatic expressions corresponding to the country of the reviewer. In 10% of cases, use a random list of idioms.
Skew the review scores so that comparatively expensive products have a smaller chance for a bad review.
During the benchmark run:
1% of products are added;
3% of initial offers are deleted and 3% are added; and
5% of reviews are added.
Updates may be divided into transactions and run in series or in parallel in a manner specified by the test sponsor. The code for loading the update stream is vendor specific but must be disclosed.
The initial bulk load does not have to be transactional in any way.
Loading the update stream must be transactional, guaranteeing that all information pertaining to a product or an offer constitutes a transaction. Multiple offers or products may be combined in a transaction. Queries should run at least in READ COMMITTED isolation, so that half-inserted products or offers are not seen.
Full text indices do not have to be updated transactionally; the update can lag up to 2 minutes behind the insertion of the literal being indexed.
The test data generator generates the update stream together with the initial data. The update stream is a set of files containing Turtle-serialized data for the updates, with all triples belonging to a transaction in consecutive order. The possible transaction boundaries are marked with a comment distinguishable from the text. The test sponsor may implement a special load program if desired. The files must be loaded in sequence but a single file may be loaded on any number of parallel threads.
The data generator should generate multiple files for the initial dump in order to facilitate parallel loading.
The same update stream can be used during all tests, starting each run from a backup containing only the initial state. In the original run, the update stream is applied starting at the measurement interval, after the SUT is in steady state.