Skip to main content

KUDAF and Metadata standards

The KUDAF initiative relates to two different Metadata standards:

DCAT-AP-NO (Data Catalog Vocabulary)

This is the Norwegian implementation of the international DCAT standard. At this level one can discover different data sources in the Felles datakatalog, for example.

Among the many different concepts included in this standard, three important concepts are relevant to the Kudaf architecture:

  1. A Catalog dcat:Catalog is defined as "a curated collection of metadata about resources (e.g., datasets and data services in the context of a data catalog)".

  2. A Dataset dcat:Dataset is defined as "a collection of data, published or curated by a single agent, and available for access or download in one or more representations". This is a broad definition, which could fit a database as well as a table within a database. But also a view on a database or the results of a query. And then is the matter of the representation: a dataset could be a JSON or CSV file containing data.

  3. A Data Service dcat:DataService is defined as "a collection of operations that provides access to one or more datasets or data processing functions". Here we can have, for example, an API.

    DCAT

RAIRD Information Model

A Norwegian implementation of the international GSIM (General Statistical Information Model) standard. Designed to address the need to describe statistical information, this framework "provides a set of standardised, consistently described information objects, which can be used as inputs and outputs in the design and the production of statistics". It was created to facilitate making Statistics Norway data available for reserchers, for example through the microdata.no portal and IDE.

This level is where the Kudaf initiative concentrates, because it enables the description (through metadata) of all the data made publicly available. The important concepts are:

  1. Variable: Refers to the metadata describing a minimalistic unit of data that is useful for data analysis purposes. In order to make it minimalistic and flexible for data analysis purposes we use a datum-based data structure (see description below).
  2. Unit identifier: The unique identifier which singles out this unit of data.

Datum-based data structure

A multi-variable source dataset, such that is commonly available from a data source, would be structured as GSIM Unit Data:

CASE_IDDOBMAR_STATGENDERDATE_MARDATE_SEPDATE_DIV
09371971-05-03M12003-08-04--

This is the typical database structure that we know well, where the CASE_ID field is the identifier (primary key) field for the whole row. The value found in a cell of that row would constitute a single DATUM, i.e. the value that populates a Data Point.

datum_concept_group

In order to provide maximum flexibility to the data researcher, we could decompose this multi-variable dataset into stand-alone, SINGLE-VARIABLE DATASETS expressed in a datum-based model. For example, the above information on the Marital Status alone could be expressed as:

IDENTIFIERVAR_REFVALUESTART_DATEEND_DATE
0937MAR_STATM2003-08-04-

In fact we could model the entire multi-variable table from above according to the datum-based approach like this:

IDENTIFIERVAR_REFVALUESTART_DATEEND_DATE
0937DOB1971-05-031971-05-03-
0937MAR_STATM2003-08-04-
0937GENDER11971-05-03-

Each such datum constitures the minimalistic unit of data we were refering to above.

And in order to describe the data contained within the datum we define a Variable. We use the siimple term Variable to refer thus to a Single-Variable Dataset.

A VARIABLE combines the meaning of a Concept with a Unit Type, to define the characteristic that is to be measured.

Finally, the metadata models for the Variable and for the Dataset are linked as shown below:

Simple Kudaf Metadata Model