Markus Stocker bio photo

Markus Stocker

Between information technology and environmental science with a flair for economics, the clarinet, and the world of soups and salads.

Email Twitter Google+ LinkedIn Github

Having discussed the RDF (meta-) data model, the RDFS and OWL languages used to describe what RDF data mean, and SSN as a vocabulary for describing sensor observations and metadata about sensing devices in RDF, this post focuses on the RDF Data Cube Vocabulary (QB vocabulary) for describing dataset observations and metadata about datasets in RDF.

Systems that collect data from devices of an environmental sensor network typically not just collect sensor observations; they also process sensor observations. Processed data is arguably no longer sensor data. Rather, it is data that results from a computational process, one that transforms input data into output data. A system that collects sensor data, processes data, and builds on RDF, RDFS and OWL—and perhaps uses the SSN vocabulary—may use the QB vocabulary to represent processed data.

Let’s look at an example.

Data

Sticking to net ecosystem exchange (NEE) measurement, used also in previous posts, I suggest the following small data sample.

Time period GPP TER NEE
2012-03-24T15:00 0.53997 1.00844 0.46847
2012-03-24T15:30 0.57284 1.08207 0.50923

NEE values are shown partioned into gross primary productivity (GPP) and total ecosystem respiration (TER), in other words TER - GPP = NEE, e.g. for the first row we have 1.00844 - 0.53997 = 0.46847.

The sensor observations discussed in the previous post serve in the rather involved computation of these values, and Shurpali et al. [1] provide more details. For this post it is sufficient to remember that those sensor observations are processed, and the result of processing are dataset observations.

The sample data consists of a header and two data rows, each of which is a dataset observation. The unit is μmol m–2 s–1. A notable difference between the sensor observations in the previous post and the dataset observations in this post is the time period. Sensor observations were sampled at 10 Hz whereas here we have one dataset observation every 30 minutes. Another difference is that a sensor observation was for mole fraction of either CO2 or H2O, and each sensor observation thus displayed a single observation value. In contrast, the dataset observations here have three values, for GPP, TER, and NEE. To have multiple columns is typical for datasets—think of Excel sheets or comma separated values (CSV) plain-text files.

Observations

I will now describe how the data shown in the table above can be represented as dataset observations, following the QB vocabulary.

First, let’s consider the header of the table. The QB vocabulary provides for the specification of data structure definitions. The table header can be translated into a data structure definition for our sample data. A data structure definition specifies one or more components. Each of the columns in our table is mapped to a component. The specification of a component includes details about the component property, e.g. timePeriod or grossPrimaryProductivity, as well as metadata about whether or not the component is required in the dataset or the position (order) of the component in the data structure. QB supports three types of component properties, namely dimension, measure, and attribute. Given the header of our table, we can thus formulate a data structure definition as follows:

dsd1 rdf:type qb:DataStructureDefinition
dsd1 qb:component cs1
cs1 rdf:type qb:ComponentSpecification
cs1 qb:dimension timePeriod
timePeriod rdf:type qb:ComponentProperty
cs1 qb:componentRequired "true"^^xsd:boolean
cs1 qb:order "1"^^xsd:int
dsd1 qb:component cs2
cs2 rdf:type qb:ComponentSpecification
cs2 qb:measure grossPrimaryProductivity
grossPrimaryProductivity rdf:type qb:ComponentProperty
cs2 qb:componentRequired "true"^^xsd:boolean
cs2 qb:order "2"^^xsd:int
dsd1 qb:component cs3
cs3 rdf:type qb:ComponentSpecification
cs3 qb:measure totalEcosystemRespiration
totalEcosystemRespiration rdf:type qb:ComponentProperty
cs3 qb:componentRequired "true"^xsd:boolean
cs3 qb:order "3"^^xsd:int
dsd1 qb:component cs4
cs4 rdf:type qb:ComponentSpecification
cs4 qb:measure netEcosystemExchange
netEcosystemExchange rdf:type qb:ComponentProperty
cs4 qb:componentRequired "true"^xsd:boolean
cs4 qb:order "4"^^xsd:int

Akin to the table header, our data structure definition dsd1 specifies four components, cs1cs4, one for each table column. Each component specification relates to a component property, e.g. cs2 relates to the grossPrimaryProductivity component property. The data structure definition also specifies that all four components are required and the order is same as in our table. We thus have a machine readable and interpretable specification for the structure of our dataset.

Dataset observations relate to datasets. In other words, they are elements of datasets. In our example, the table is the dataset and each row is an element of the table. The dataset then relates to the data structure definition. In practice:

d1 rdf:type qb:DataSet
d1 qb:structure dsd1

We now have a dataset d1 of structure dsd1.

How about the rows of the table? They are dataset observations, elements of dataset d1, with component property values corresponding to the respective values in table cells. Let’s look at the first row.

do1 rdf:type qb:Observation
do1 qb:dataSet d1
do1 timePeriod "2012-03-24T15:00"^^xsd:dateTime
do1 grossPrimaryProductivity "0.53997"^^xsd:double
do1 totalEcosystemRespiration "1.00844"^^xsd:double
do1 netEcosystemExchange "0.46847"^^xsd:double

And the second row.

do2 rdf:type qb:Observation
do2 qb:dataSet d1
do2 timePeriod "2012-03-24T15:30"^^xsd:dateTime
do2 grossPrimaryProductivity "0.57284"^^xsd:double
do2 totalEcosystemRespiration "1.08207"^^xsd:double
do2 netEcosystemExchange "0.50923"^^xsd:double

Notes

Using the QB vocabulary and RDF, we have created a machine readable and interpretable version of our tabular sample data. It consists of metadata for the table structure and data for each table row.

Compared to the SSN vocabulary for sensor observations, the QB vocabulary is better suited for multidimensional datasets. A sensor observation relates the single observation value with the sensor that made the observation, and the property of the feature for which the observation was made. In contrast, a dataset observation can relate to multiple values via several component properties.

The SSN and QB vocabularies are of interest to software systems that acquire data from environmental sensor networks and process such data. The vocabularies are of particular interest to ontology-based systems that build on RDF databases, such as the Stardog RDF database. In such systems, the SSN vocabulary can serve in the curation of sensor observations acquired from environmental sensor networks while the QB vocabulary can serve in the curation of processed dataset observations. Such processing includes the translation of sensor observations into dataset observations as well as the processing of input sets of dataset observations into output sets of dataset observations, i.e. dataset processing. Naturally, any data processing method performed on the original tabular (perhaps Excel) data can also be performed on a QB dataset. This includes simple aggregation functions as well as more involved computations on CO2 flux data.

Exercise

Inspired by the exercise for sensor observations in the previous post, let’s run a few SPARQL queries against our QB dataset. First, point your browser at this SPARQL query engine. Then select Text for “Output”. If you now copy and paste the following SPARQL query and hit the “Get Results” button you will see a table that pretty much resembles the one above.

prefix qb: <http://purl.org/linked-data/cube#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix : <http://envi.uef.fi/saicos#>

select ?timePeriod ?gpp ?ter ?nee
from <http://markusstocker.com/assets/posts/dataset-observations/observations.rdf>
where {
  [ 
    rdf:type qb:Observation ;
    :timePeriod ?timePeriod ;
    :grossPrimaryProductivity ?gpp ;
    :totalEcosystemRespiration ?ter ;
    :netEcosystemExchange ?nee
  ]
}
order by asc(?timePeriod)

We can also inspect the data structure definition of our dataset.

prefix qb: <http://purl.org/linked-data/cube#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix : <http://envi.uef.fi/saicos#>

select ?property ?required ?order
from <http://markusstocker.com/assets/posts/dataset-observations/observations.rdf>
where {
  :d1 rdf:type qb:DataSet .
  :d1 qb:structure [ 
    qb:component [
      ?componentProperty ?property ;
      qb:componentRequired ?required ;
      qb:order ?order
    ]
  ] .
  ?property rdf:type qb:ComponentProperty .
}
order by asc(?order)

References

[1] Shurpali, Narasinha J. and Hyvönen, Niina P. and Huttunen, Jari T. and Clement, Robert J. and Reichstein, Markus and Nykänen, Hannu and Biasi, Christina and Martikainen, Pertti J. (2009). Cultivation of a perennial grass for bioenergy on a boreal organic soil – carbon sink or source? GCB Bioenergy, 1(1):35-50, Blackwell Publishing Ltd. doi:10.1111/j.1757-1707.2009.01003.x

This post is part of a series. Previous posts discussed RDF, RDFS and OWL, the extraction of metadata about sensing devices from various documents, and the representation of sensor data using the SSN ontology. The next one is about data interpretation and the composition of information acquired from data to knowledge about situations observed by environmental monitoring systems.