Thursday, June 9, 2011

DSPL: Dataset Publishing Language



DSPL stands for Dataset Publishing Language. Datasets described in DSPL can be imported into the Google Public Data Explorer, a tool that allows for rich, visual exploration of the data.


This tutorial provides a step-by-step example of how to prepare a basic DSPL dataset.

A DSPL dataset is a bundle that contains an XML file and a set of CSV files. The CSV files are simple tables containing the data of the dataset. The XML file describes the metadata of the dataset, including informational metadata like descriptions of measures, as well as structural metadata like references between tables. The metadata lets non-expert users explore and visualize your data.

The only prerequisite for understanding this tutorial is a good level of understanding of XML. Some understanding of simple database concepts (e.g., tables, primary keys) may help, but it's not required. For reference, the completed XML file and complete dataset bundle associated with this tutorial are also available for review.


Before starting to create our dataset, here is a high-level overview of what a DSPL dataset contains:
General information: About the dataset
Concepts: Definitions of "things" that appear in the dataset (e.g., countries, unemployment rate, gender, etc.)
Slices: Combinations of concepts for which there are data
Tables: Data for concepts and slices. Concept tables hold enumerations and slice tables hold statistical data
Topics: Used to organize the concepts of the dataset in a meaningful hierarchy through labeling


To illustrate these rather abstract notions, consider the dataset (with dummy data) used throughout this tutorial: statistical time series for unemployment and population by country, and population by gender for US states.





This example dataset defines the following concepts:
country
gender
population
state
unemployment rate
year


Concepts that are categorical, such as state, are associated with concept tables, which enumerate all their possible values (California, Arizona, etc.). Concepts may have additional columns for properties such as the name or the country of a state.


Slices define each combination of concepts for which there is statistical data in the dataset. A slice contains dimensions and metrics. In the above picture, the dimensions are blue and the metrics are orange. In this example, the slice gender_country_slice has data for the metric population and the dimensionscountry, year and gender. Another slice, called country_slice, gives total yearly population numbers (metric) for countries.


In addition to dimensions and metrics, slices also reference tables, which contain the actual data.


Let's now walk step-by-step through the creation of such a dataset in DSPL.
To get started, we need to create an XML file for our dataset. Here is the beginning of a DSPL description for our example dataset:The dataset description starts with a top-level element. The targetNamespace attribute contains a URI that uniquely identifies this dataset. The dataset's namespace is especially important when publishing the dataset, as it will be the global identifier of your dataset, and the means for others to refer to it.



Note that the targetNamespace attribute may be omitted. In this case a unique namespace is automatically generated when the dataset is imported. read more...