Author Archives: gregg

Implementing CSV on the Web

As I blogged about before, I’ve implemented the current drafts of CSV on the Web in the rdf-tabular gem. The gem is is available from rdf-tabular repo and is in the public domain (Unlicense) and is freely usable by anyone wishing to get a start on their own implementation. For those wishing to take an incremental approach, this post describes the basic workings of the gem, highlights more advanced cases necessary to pass the Test Suite and attempts to provide some insight into the process of implementing the specifications.

CSVW – in a nutshell

The basic purpose of the CSVW specifications is to do an informed process of CSVs into an annotated data model, and use this model as the basis of creating RDF or JSON. At a minimum, this means assuming that the first row of a CSV is a header row containing titles for cells in that CSV, and that each row constitutes a record with properties based on the column titles with values from each cell. We’ll use the canonical example of the tree-ops example from the Model for Tabular Data

GID,On Street,Species,Trim Cycle,Inventory Date
1,ADDISON AV,Celtis australis,Large Tree Routine Prune,10/18/2010
2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,6/2/2010

This example has a header row and two data rows. The simple transformation of this to RDF without any external metadata would yield the following (in minimal mode):

@base <http://w3c.github.io/csvw/tests/tree-ops.csv> .

[
  <#GID> "2";
  <#Inventory%20Date> "6/2/2010";
  <#On%20Street> "EMERSON ST";
  <#Species> "Liquidambar styraciflua";
  <#Trim%20Cycle> "Large Tree Routine Prune"
] .

[
  <#GID> "1";
  <#Inventory%20Date> "10/18/2010";
  <#On%20Street> "ADDISON AV";
  <#Species> "Celtis australis";
  <#Trim%20Cycle> "Large Tree Routine Prune"
] .

This is minimally useful, but we can use this as a walk-through of how the rdf-tabular gem creates this.

The first step is to retrieve the document from it’s URL: http://w3c.github.io/csvw/tests/tree-ops.csv. We then can look for metadata associated with the file to inform the creation of the annotated table; this information might come from by the user specifying to location of metadata separately, from an HTTP Link header, by looking at http://w3c.github.io/csvw/tests/tree-ops.csv-metadata.json for file-specific metadata, or at http://w3c.github.io/csvw/tests/metadata.json for directory-specific metadata. In this case, there is none, so we constructed embedded metatata from just the header row. The process for locating metadata is described in Creating Annotated Tables, and creating embedded metadata is described in Parsing Tabular Data. Basically, we’re looking to create a table description with a schema describing each column resulting in the following:

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "http://w3c.github.io/csvw/tests/tree-ops.csv",
  "tableSchema": {
    "columns": [
      {"titles": "GID"},
      {"titles": "On Street"},
      {"titles": "Species"},
      {"titles": "Trim Cycle"},
      {"titles": "Inventory Date"}
    ]
  }
}

By following the rules in the Metadata Format, we then create annotations on an annotated table (an abstract model, shown in arbitrary JSON):

{
  "@type": "AnnotatedTableGroup",
  "tables": [
    {
      "@type": "AnnotatedTable",
      "url": "http://w3c.github.io/csvw/tests/tree-ops.csv",
      "columns": [
        {
          "@id": "http://w3c.github.io/csvw/tests/tree-ops.csv#col=1",
          "@type": "Column",
          "number": 1,
          "sourceNumber": 1,
          "cells": [],
          "name": "GID",
          "titles": {"und": ["GID"]}
        },
        {
          "@id": "http://w3c.github.io/csvw/tests/tree-ops.csv#col=2",
          "@type": "Column",
          "number": 2,
          "sourceNumber": 2,
          "cells": [],
          "name": "On%20Street",
          "titles": {"und": ["On Street"]}
        },
        {
          "@id": "http://w3c.github.io/csvw/tests/tree-ops.csv#col=3",
          "@type": "Column",
          "number": 3,
          "sourceNumber": 3,
          "cells": [],
          "name": "Species",
          "titles": {"und": ["Species"]}
        },
        {
          "@id": "http://w3c.github.io/csvw/tests/tree-ops.csv#col=4",
          "@type": "Column",
          "number": 4,
          "sourceNumber": 4,
          "cells": [],
          "name": "Trim%20Cycle",
          "titles": {"und": ["Trim Cycle"]}
        },
        {
          "@id": "http://w3c.github.io/csvw/tests/tree-ops.csv#col=5",
          "@type": "Column",
          "number": 5,
          "sourceNumber": 5,
          "cells": [],
          "name": "Inventory%20Date",
          "titles": {"und": ["Inventory Date"]}
        }
      ],
      "rows": []
    }
  ]
}

Here we gather some metadata based on the row and column numbers (logical and actual, which are the same here) and create a default for the column name from titles. Also, note that titles is expanded to a natural language property, which is basically a JSON-LD Language Map where und is used when there is no language. The column name is percent encoded so that it can be used in a URI template.

Note that the rows and cells values are empty, as we haven’t actually processed any rows from the input yet.

The rdf-tabular gem implements the RDF::Reader pattern, which takes care of much of the process of opening the file and getting useful metadata. In the case of the rdf-tabular gem, this includes the steps described in Creating Annotated Tables which involves recursive invocations of the reader, but nominally, it yields a reader instance which implements the RDF::Enumerable pattern. In particular, the reader implements an #each method to yield each RDF Statement (triple); this is where the work actually happens. A call looks like the following:

RDF::Tabular::Reader.open("http://w3c.github.io/csvw/tests/tree-ops.csv") do |reader|
  reader.each do |statement|
    puts statement.inspect
  end
end

Basically, the job of the reader is to create the abstract tabular data model and use it to read each row of the table(s) described in the model and use that to generate RDF triples (it can also be used to generate JSON output without going the RDF, but that’s another story).

The rdf-tabular gem implements RDF::Tabular::Metadata, with subclasses for the different kinds of metadata we need. This provides the #each_row method, which given an input file yields each row. For the Ruby implementation, we use the CSV library and use dialect information to set parsing defaults. The gist of the implementation looks like the following:

def each_row(input)
  csv = ::CSV.new(input, csv_options)
  # Skip skipRows and headerRowCount
  number, skipped = 0, (dialect.skipRows.to_i + dialect.headerRowCount)
  (1..skipped).each {csv.shift}
  csv.each do |data|
    number += 1
    yield(Row.new(data, self, number, number + skipped))
  end
end

A Row then abstracts information for each table row and provides the cells for that row:

# Wraps each resulting row
class Row
  attr_reader :values

  # Class for returning values
  Cell = Struct.new(:table, :column, :row, :stringValue, :value, :errors)

  def initialize(row, metadata, number, source_number)
    @values = []
    skipColumns = metadata.dialect.skipColumns.to_i
    columns = metadata.tableSchema.columns
    row.each_with_index do |value, index|
      next if index < skipColumns
      column = columns[index - skipColumns]
      @values << cell = Cell.new(metadata, column, self, value)
    end
  end
end

There’s more to this in the actual implementation, of course, but this handles a simple value.

Now we can implement RDF::Tabular::Reader#each_statement:

def each_statement(&block)
  metadata.each_row(input) do |row|
    default_cell_subject = RDF::Node.new
    row.values.each_with_index do |cell, index|
      propertyUrl = RDF::URI("#{metadata.url}##{cell.column.name}")
      yield RDF::Statement(default_cell_subject, propertyUrl, cell.value)
    end
  end
end

That’s pretty much the basis of a Ruby implementation. There’s more work to do in #each_statement, as it’s initially invoked with a TableGroup, which recursively invokes it again for each Table, which then calls again for the actual CSV, but that’s all setup. There’s also work in RDF::Tabular::Reader#initialize to find the metadata, ensure that it is compatible with the actual tables, and so forth.

Fleshing out with more details

So, this implements a basic reader interface from a CSV using the abstract tabular data model, but the output’s not too interesting. What if we want to make the data richer:

  • Give unique identifiers (subjects) to the cells in a row
  • Use subjects for different cells
  • Define property URIs for each cell
  • Assign and match cell string values to datatypes
  • Parse microformats within a given cell

For that we need to define a Metadata file.

Defining Metadata

A Metadata file is a JSON-LD document (really, it has the structure of a JSON-LD document, it is parsed as JSON) which allows us to define properties on metadata declarations which directly relate to the abstract tabular data model. For example, let’s look at the metadata description for the tree-ops example: tree-ops.csv-metadata.json:

{
  "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
  "url": "tree-ops.csv",
  "dc:title": "Tree Operations",
  "dcat:keyword": ["tree", "street", "maintenance"],
  "dc:publisher": {
    "schema:name": "Example Municipality",
    "schema:url": {"@id": "http://example.org"}
  },
  "dc:license": {"@id": "http://opendefinition.org/licenses/cc-by/"},
  "dc:modified": {"@value": "2010-12-31", "@type": "xsd:date"},
  "tableSchema": {
    "columns": [{
      "name": "GID",
      "titles": ["GID", "Generic Identifier"],
      "dc:description": "An identifier for the operation on a tree.",
      "datatype": "string",
      "required": true
    }, {
      "name": "on_street",
      "titles": "On Street",
      "dc:description": "The street that the tree is on.",
      "datatype": "string"
    }, {
      "name": "species",
      "titles": "Species",
      "dc:description": "The species of the tree.",
      "datatype": "string"
    }, {
      "name": "trim_cycle",
      "titles": "Trim Cycle",
      "dc:description": "The operation performed on the tree.",
      "datatype": "string"
    }, {
      "name": "inventory_date",
      "titles": "Inventory Date",
      "dc:description": "The date of the operation that was performed.",
      "datatype": {"base": "date", "format": "M/d/yyyy"}
    }],
    "primaryKey": "GID",
    "aboutUrl": "#gid-{GID}"
  }
}

Here we have a couple of different things going on. Note the following:

{
  "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
  "url": "tree-ops.csv",
  "dc:title": "Tree Operations",
  "dcat:keyword": ["tree", "street", "maintenance"],
  "dc:publisher": {
    "schema:name": "Example Municipality",
    "schema:url": {"@id": "http://example.org"}
  },
  "dc:license": {"@id": "http://opendefinition.org/licenses/cc-by/"},
  "dc:modified": {"@value": "2010-12-31", "@type": "xsd:date"},
  ...
}

The prefixed name properties are common properties, basically just JSON-LD that’s inserted into the model to define annotations on the model. In this case, their defined on a Table, so they annotate that model. As it is JSON-LD, the string values take the @langauge defined in the context. CSVW uses a dialect of JSON-LD which places some restrictions on what can go here. Basically, nothing more can go into the @context besides @language and @base; it also must use http://www.w3.org/ns/csvw and only the terms and prefixes defined within that context can be used along with absolute IRIs.

{
  "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
  "url": "tree-ops.csv",
  ...
  "tableSchema": {
    "columns": [{
      "name": "GID",
      "titles": ["GID", "Generic Identifier"],
      "dc:description": "An identifier for the operation on a tree.",
      "datatype": "string",
      "required": true
    }, {
      "name": "on_street",
      "titles": "On Street",
      "dc:description": "The street that the tree is on.",
      "datatype": "string"
    }, {
      "name": "species",
      "titles": "Species",
      "dc:description": "The species of the tree.",
      "datatype": "string"
    }, {
      "name": "trim_cycle",
      "titles": "Trim Cycle",
      "dc:description": "The operation performed on the tree.",
      "datatype": "string"
    }, {
      "name": "inventory_date",
      "titles": "Inventory Date",
      "dc:description": "The date of the operation that was performed.",
      "datatype": {"base": "date", "format": "M/d/yyyy"}
    }],
    "primaryKey": "GID",
    "aboutUrl": "#gid-{GID}"
  }
}

The tableSchema property tells us that this is a Table description; we could also have added a "@type": "Table" property to make this explicit, but it’s not necessary. Metadata always starts with either a Table description or a TableGroup description.

We define an explicit name property for the first column. If we didn’t, it would take the first value from titles. The column has a common property, which is presently not used in the transformation, but exists in the annotated tabular data model. It also declares the data type to be string, which is a synonym for xsd:string, and a value is required, meaning that it is considered an error if a cell value has an empty string (or one matching that defined using the null annotation). Note that datatype is an inherited property, meaning that it could have been defined on the Table, Schema or Column and would be in scope for all cells based on the inheritance model.

The last column uses a complex datatype: It is based on xsd:date and uses a format string to match string values and map them onto the datatype. This allows a date of the form 4/17/2015 to be interpreted as "2015-04-17"^^xsd:date by using date field symbols as defined in UAX35.

All of these map pretty much as you would expect onto the annotated tabular data model:

{
  "@type": "AnnotatedTableGroup",
  "tables": [
    {
      "@type": "AnnotatedTable",
      "url": "http://w3c.github.io/csvw/tests/test011/tree-ops.csv",
      "dc:title": {"@value": "Tree Operations","@language": "en"},
      "dcat:keyword": [
        {"@value": "tree","@language": "en"},
        {"@value": "street","@language": "en"},
        {"@value": "maintenance","@language": "en"}
      ],
      "dc:publisher": {
        "schema:name": {"@value": "Example Municipality","@language": "en"},
        "schema:url": {"@id": "http://example.org"}
      },
      "dc:license": {"@id": "http://opendefinition.org/licenses/cc-by/"},
      "dc:modified": {"@value": "2010-12-31","@type": "xsd:date"},
      "columns": [
        {
          "@id": "http://w3c.github.io/csvw/tests/test011/tree-ops.csv#col=1",
          "@type": "Column",
          "number": 1,
          "sourceNumber": 1,
          "cells": [],
          "name": "GID",
          "titles": {"en": ["GID", "Generic Identifier"]},
          "dc:description": {"@value": "An identifier for the operation on a tree.", "@language": "en"},
          "datatype": {"base": "string"},
          "required": true
        },
        {
          "@id": "http://w3c.github.io/csvw/tests/test011/tree-ops.csv#col=2",
          "@type": "Column",
          "number": 2,
          "sourceNumber": 2,
          "cells": [],
          "name": "on_street",
          "titles": {"en": ["On Street"]},
          "dc:description": {"@value": "The street that the tree is on.","@language": "en"},
          "datatype": {"base": "string"}
        },
        {
          "@id": "http://w3c.github.io/csvw/tests/test011/tree-ops.csv#col=3",
          "@type": "Column",
          "number": 3,
          "sourceNumber": 3,
          "cells": [],
          "name": "species",
          "titles": {"en": ["Species"]},
          "dc:description": {"@value": "The species of the tree.","@language": "en"},
          "datatype": {"base": "string"}
        },
        {
          "@id": "http://w3c.github.io/csvw/tests/test011/tree-ops.csv#col=4",
          "@type": "Column",
          "number": 4,
          "sourceNumber": 4,
          "cells": [],
          "name": "trim_cycle",
          "titles": {"en": ["Trim Cycle"]},
          "dc:description": {"@value": "The operation performed on the tree.","@language": "en"},
          "datatype": {"base": "string"}
        },
        {
          "@id": "http://w3c.github.io/csvw/tests/test011/tree-ops.csv#col=5",
          "@type": "Column",
          "number": 5,
          "sourceNumber": 5,
          "cells": [],
          "name": "inventory_date",
          "titles": {"en": ["Inventory Date"]},
          "dc:description": {"@value": "The date of the operation that was performed.","@language": "en"},
          "datatype": {"base": "date", "format": "M/d/yyyy"}
        }
      ],
      "rows": []
    }
  ]
}

Note that the common properties have been expanded and the @context simplified to remove @language. Also datatype values have been normalized to the expanded form. This happens through the process of Normalization.

Normalization

Normalization places metadata into a consistent format where compact values are expanded, values which may be an Array are made into an Array, link properties have their URL expanded based on the base of the metadata and object properties which are in the form of a URL string are opened and replaced with the content of the resource they reference.

Ensuring Metadata Compatibility (was Merging Metadata)

The Locating Metadata describes the various places metadata files may be found. Creating Annotated Tables defines the process of creating the final metadata when starting with either a Metadata File or a CSV. This calls for ensuring Metadata Compatibility. Given metadata, this can be used to find one or more tabular data files. These files are considered compatible with the metadata, if they are referenced from the metadata (i.e., there is a Table Description referencing that particular tabular data file), and the columns in the Table Description Schema, match the columns in the tabular data file by comparing titles or names from the column metadata with the title of the tabular data file column, if it has one. This is necessary to ensure that the CSV file matches the schema described in metadata.

URI Templates

In the tree-ops-virtual.json example, URI template properties were defined:

{
  "url": "tree-ops.csv",
  "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
  "tableSchema": {
    "columns": [{
      "name": "GID",
      "titles": "GID",
      "datatype": "string",
      "propertyUrl": "schema:url",
      "valueUrl": "#gid-{GID}"
    }, {
      "name": "on_street",
      "titles": "On Street",
      "datatype": "string",
      "aboutUrl": "#location-{GID}",
      "propertyUrl": "schema:streetAddress"
    }, {
      "name": "species",
      "titles": "Species",
      "datatype": "string",
      "propertyUrl": "schema:name"
    }, {
      "name": "trim_cycle",
      "titles": "Trim Cycle",
      "datatype": "string"
    }, {
      "name": "inventory_date",
      "titles": "Inventory Date",
      "datatype": {"base": "date", "format": "M/d/yyyy"},
      "aboutUrl": "#event-{inventory_date}",
      "propertyUrl": "schema:startDate"
    }, {
      "propertyUrl": "schema:event",
      "valueUrl": "#event-{inventory_date}",
      "virtual": true
    }, {
      "propertyUrl": "schema:location",
      "valueUrl": "#location-{GID}",
      "virtual": true
    }, {
      "aboutUrl": "#location-{GID}",
      "propertyUrl": "rdf:type",
      "valueUrl": "schema:PostalAddress",
      "virtual": true
    }],
    "aboutUrl": "#gid-{GID}"
  }
}

This introduces several new concepts:

  • The Schema has an aboutUrl property: “#gid-{GID}”. This is a URI Template, where GID acts as a variable, taking the cell value of the cell in the GID column to construct a URI. In this case it constructs values such as <http://w3c.github.io/csvw/examples/tree-ops.csv#gid-1>. This is because the Table url is tree-ops.csv, which is a URL relative to the location of the metadata file. In the first row, the value of the GID column is 1, so that is substituted to create “#gid-1”, then resolved against url. This aboutUrl then defines the default subject for all cells within that row.

The first column has “propertyUrl”: “schema:url”, which turns into the absolute URL http://schema.org/url, and is used as the predicate for that cell. As the column as a valueUrl (“valueUrl”: “#gid-{GID}”), that is expanded and used as the object for that cell. Thus, the first cell of the first row would result in the following triple:

<#gid-1> schema:url <#gid-1> .

(relative to the URL file location).

The second column has it’s own aboutUrl (“aboutUrl”: “#location-{GID}”), meaning that the subject for triples for this column are different than the default subject.

The last three columns are virtual columns, as they don’t correspond to data actually in the CSV; these are used for injecting information into the row. When fully processed, the following RDF is created (again, in minimal mode):

@base <http://w3c.github.io/csvw/examples/tree-ops.csv> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<#event-2010-06-02> schema:startDate "2010-06-02"^^xsd:date .

<#event-2010-10-18> schema:startDate "2010-10-18"^^xsd:date .

<#gid-1> schema:event <#event-2010-10-18>;
   schema:location <#location-1>;
   schema:name "Celtis australis";
   schema:url <#gid-1>;
   <#trim_cycle> "Large Tree Routine Prune" .

<#gid-2> schema:event <#event-2010-06-02>;
   schema:location <#location-2>;
   schema:name "Liquidambar styraciflua";
   schema:url <#gid-2>;
   <#trim_cycle> "Large Tree Routine Prune" .

<#location-1> a schema:PostalAddress;
   schema:streetAddress "ADDISON AV" .

<#location-2> a schema:PostalAddress;
   schema:streetAddress "EMERSON ST" .

Processing this requires some enhancement to RDF::Tabular::Row#initialize method:

# Wraps each resulting row
class Row
  attr_reader :values

  # Class for returning values
  Cell = Struct.new(:table, :column, :row, :stringValue, :aboutUrl, :propertyUrl, :valueUrl, :value, :errors) do

    def set_urls(mapped_values)
      %w(aboutUrl propertyUrl valueUrl).each do |prop|
        # If the cell value is nil, and it is not a virtual column
        next if prop == "valueUrl" && value.nil? && !column.virtual
        if v = column.send(prop.to_sym)
          t = Addressable::Template.new(v)
          mapped = t.expand(mapped_values).to_s
          url = row.context.expand_iri(mapped, documentRelative: true)
          self.send("#{prop}=".to_sym, url)
        end
      end
    end

  end

  def initialize(row, metadata, number, source_number)
    @values = []
    skipColumns = metadata.dialect.skipColumns.to_i
    columns = metadata.tableSchema.columns
    row.each_with_index do |value, index|
      next if index < skipColumns
      column = columns[index - skipColumns]
      @values << cell = Cell.new(metadata, column, self, value)

      datatype = column.datatype || Datatype.new(base: "string", parent: column)
      cell_value = value_matching_datatype(value.dup, datatype, expanded_dt, column.lang)
      map_values[columns[index - skipColumns].name] =  cell_value.to_s

    end

    # Map URLs for row
    ...
  end
end

This introduces datatype matching through #value_matching_datatype (not detailed), creates a map_values structure that can be used for URI template processing, and then calls Cell#set_urls to actually create aboutUrl, propertyUrl, and valueUrl annotations on the cell.

The #each_statement method is updated to take these into consideration:

def each_statement(&block)
  metadata.each_row(input) do |row|
    default_cell_subject = RDF::Node.new
    row.values.each_with_index do |cell, index|
      cell_subject = cell.aboutUrl || default_cell_subject
      propertyUrl = cell.propertyUrl || RDF::URI("#{metadata.url}##{cell.column.name}")
        yield RDF::Statement(cell_subject, propertyUrl, cell.column.valueUrl || cell.value)
    end
  end
end

Here the only difference is that we use cell.aboutUrl, cell.propertyUrl, and cell.valueUrl if they are defined, and the defaults otherwise.

Multiple Values

Microsyntaxes are common in CSVs, and there are many different kind. A microsyntax is some convention for formatting information within a cell. CSVW supports delimited values within a cell, so that a list of elements can be provided, allowing a single cell to contain multiple values.

For example the tree-ops-ext.csv example allows for multiple comments on a record using “;” as a separator:

GID,On Street,Species,Trim Cycle,Diameter at Breast Ht,Inventory Date,Comments,Protected,KML
1,ADDISON AV,Celtis australis,Large Tree Routine Prune,11,10/18/2010,,,"<Point><coordinates>-122.156485,37.440963</coordinates></Point>"
2,EMERSON ST,Liquidambar styraciflua,Large Tree Routine Prune,11,6/2/2010,,,"<Point><coordinates>-122.156749,37.440958</coordinates></Point>"
6,ADDISON AV,Robinia pseudoacacia,Large Tree Routine Prune,29,6/1/2010,cavity or decay; trunk decay; codominant leaders; included bark; large leader or limb decay; previous failure root damage; root decay;  beware of BEES,YES,"<Point><coordinates>-122.156299,37.441151</coordinates></Point>"

The metadata file adds the separator property to the comments column:

{
  "tableSchema": {
    "columns": [
      ...
      {
        "name": "comments",
        "titles": "Comments",
        "dc:description": "Supplementary comments relating to the operation or tree.",
        "datatype": "string",
        "separator": ";"
      }
    ]
  }
}

The effect of separator is to split the string value into multiple values and parse them using any datatype description. This is shown using slight modifications to RDF::Tabular::Row#initialize and RDF::Tabular::Reader#each_statement:

# Wraps each resulting row
class Row
  attr_reader :values
  def initialize(row, metadata, number, source_number)
    @values = []
    skipColumns = metadata.dialect.skipColumns.to_i
    columns = metadata.tableSchema.columns
    row.each_with_index do |value, index|
      next if index < skipColumns
      column = columns[index - skipColumns]
      @values << cell = Cell.new(metadata, column, self, value)
      datatype = column.datatype || Datatype.new(base: "string", parent: column)


      cell_values = column.separator ? value.split(column.separator) : [value]

      cell_values = cell_values.map do |v|
        value_matching_datatype(v.dup, datatype, expanded_dt, column.lang)
      end

      cell.value = (column.separator ? cell_values : cell_values.first)

      map_values[columns[index - skipColumns].name] =  (column.separator ? cell_values.map(&:to_s) : cell_values.first.to_s)
    end

    # Map URLs for row
    @values.each_with_index do |cell, index|
      mapped_values = map_values.merge(
        "_name" => URI.decode(cell.column.name),
        "_column" => cell.column.number,
        "_sourceColumn" => cell.column.sourceNumber
      )
      cell.set_urls(mapped_values)
    end

  end
end

The only change to #each_statement is to consider multiple cell values:

def each_statement(&block)
  metadata.each_row(input) do |row|
    default_cell_subject = RDF::Node.new
    row.values.each_with_index do |cell, index|
      cell_subject = cell.aboutUrl || default_cell_subject
      propertyUrl = cell.propertyUrl || RDF::URI("#{metadata.url}##{cell.column.name}")

      if cell.column.valueUrl
        yield RDF::Statement(cell_subject, propertyUrl, cell.column.valueUrl)
      else
        Array(cell.value).each do |v|
          yield RDF::Statement(cell_subject, propertyUrl, v)
        end
      end

    end
  end
end

Common Properties and Notes

Common Properties and Notes are simply embedded JSON-LD within the metadata file (although limited) to a specific dialect to simplify implementations. In Ruby, of course, we have available a full json-ld gem, so turning this into RDF shouldn’t present any problems.

Conclusion

This article has shown the basis for implementing the CSV on the Web specifications, and hopefully will aid in more implementations.

Update 2014-05-14

This post has been updated to reflect the fact the metadata merging has been removed from CSVW, simplifying an implementation even further.

CSV on the Web

As many who follow me know, I’ve long been involved in the Linked Data and Semantic Web movements. It’s been my privilege to work on several successful projects, notably RDFa, Microdata RDF, and JSON-LD. For the past several months I’ve actively been working with the W3C CSV on the Web Working Group to help define a mechanism for transforming tabular data into RDF and JSON.

Much of the data published on the Web is not HTML, or even some other structured content. In fact, a substantial amount of information is published as tabular data (comma-separated-values or tab-separated-values, commonly CSVs). This includes a vast amount of open government publications, such as records, weather and climate information, infrastructure assets, and so forth (see Use Cases for more detail). The charter of the CSV on the Web Working Group is to provide technologies to promote greater interoperability of these datasets.

Real data is often messy, and information contained in relational databases and large spreadsheets is often lost when it’s published as CSVs:

  • Foreign Key relationships between tables
  • Primary Keys of tables (including composite primary keys)
  • Well Known Identifiers of entities used in tables
  • Datatypes of column data, such as date/time or numeric values

The aim of the working group is to create a common Model for Tabular Data, and Metadata Format for describing tabular data. Given this, validators, converters and viewers are enabled to work with datasets by using a common data model.

CSV-LD

My work started about a year ago when the Working Group was being established. The group was considering using JSON-LD as a way of describing metadata about tables, and the work we had done on JSON-LD Framing seemed like a good fit for both a metadata and transformation specification. My thoughts are expressed in a wiki page on CSV-LD; the idea was that each row of a table could map to values within a JSON-LD template which could then be turned into a JSON-LD document made from each row from the spreadsheet

Looking back there are a number of over-simiplifications on this spec, but the use of JSON-LD to describe information about tabular data remains.

CSV Data Model

The Model for Tabular Data describes an abstract data model for groups of tables, tables, columns, rows and individual cells. As such, it becomes a useful mechanism to refer to information when performing validation or conversion of CSVs. In practice, a complete in-memory data model doesn’t typically need to be created, but it serves a useful purpose: Each entity in this model describes its relationship with other entities using core annotations; for example, the set of tables in a group of tables, the rows and columns in a table, and the cells within both rows and columns. The model also describes other useful information, such as the datatypes of a cell value, and if the cell contains multiple values (using an internal separator).

A converter, such as a CSV2RDF or CSV2JSON converter, can then logically iterate over these entities to generate the resulting information. A validator can use information about datatypes to validate cell values and foreign key relationships to verify the integrity of published information. A viewer can use information about character sets, table direction and text direction to properly present information originally published as CSVs.

Creating a data model from CSVs is the process of either starting with metadata describing those CSVs, or locating metadata given a CSV. The Model for Tabular Data describes the process of locating such metadata and reconciling multiple metadata files that may be encountered. The Metadata Format then describes the structure of this data (syntactically, as a JSON-LD formatted document) and working with it to create the basis of the core annotations of the annotated data model.

As an example, I recently worked with others to release the rdf-vocab Ruby gem, which includes a Ruby version of different RDF Ontologies/Vocabularies used for reasoning on things such as the Structured Data Linter. One vocabulary we very much wanted to include as the IANA Link Relations. This is not published as any kind of RDF dataset, much less as an RDFS or OWL vocabulary definition. In fact, this has prevented it from being used in different standards work, as the link relation for describes cannot easily be dereferenced to find it’s definition. Formally, the describes link relation maps to the http://www.iana.org/assignments/link-relations/describes URI, but dereferencing this results in an HTTP 404 status. However, this doesn’t need to get in the way of defining an RDFS vocabulary based on this dataset. Thankfully, the link relations are published in other formats, including csv. This allows us to construct a metadata description of this dataset:

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "http://www.iana.org/assignments/link-relations/link-relations-1.csv",
  "tableSchema": {
    "aboutUrl": "{name}",
    "lang": "en",
    "columns": [
      {"name": "name", "title": "Relation Name", "propertyUrl": "rdfs:label"},
      {"name": "comment", "title": "Description", "propertyUrl": "rdfs:comment"},
      {"name": "type", "title": "Reference", "propertyUrl": "rdf:type", "valueUrl": "rdf:Property"},
      {"title": "Notes", "suppressOutput": true}
    ]
  }
}

This says that there is a table located at http://www.iana.org/assignments/link-relations/link-relations-1.csv with a schema. Each row identifies a different subject using the aboutUrl property. The "{name}" value is a URI template that creates a URI from the name variable relative to the location of the CSV.

Each column is given a name and title, along with a propertyUrl which is used to relate the subject with the value of a given cell. The first two columns are textual, and result in an RDF Literal when mapped to RDF. The lang property defines the overall language to apply to plain literals in this dataset. The third column is turned into a URI using the valueUrl property and the fourth column is simply ignored; note that the cell contents of the third column aren’t actually used to create the value, and the value is hard-coded as rdf:Property.

Mapping this on the CSV, for example, turns the first row into the following RDF:

<http://www.iana.org/assignments/link-relations/about> rdfs:label "about@en";
  rdfs:comment "Refers to a resource that is the subject of the link's context."@en;
  rdf:type rdf:Property .

Within the RDF.rb tool chain, this allows the use of RDF::Vocab::IANA.about to refer to this property definition. The reasoner knows it’s a property, because it’s type if rdf:Property. The completely transformed CSV can be found here.

The group has just released new working drafts for Model for Tabular Data and the Metadata Format, along with CSV2RDF and CSV2JSON.

An implementation in Ruby

My method for working on specifications is to develop code implementing these specifications as they are being developed. This, of necessity, means that there is a lot of work (and re-work) as the details are being worked out, but I view it as a necessary part of spec development to ensure that the result can be implemented. In this case, this is embodied in the rdf-tabular gem, which implements the Model for Tabular Data and Metadata Format as well as both CSV2RDF and CSV2JSON and much of the validation required for these to work. As is true for all RDF.rb gems, this is available in the public domain (Unlicense) and is freely usable by anyone wishing to get a start on their own implementation.

You can try this out on my RDF Distiller, or using the Structured Data Linter.

Test Suite

With updated working drafts, the next phase of the CSVW Working Group will be to develop more tests. There is already a fair Test Suite available which probes various corner-cases of validating and converting tabular data. There are already over 80 tests defined, and more will be forthcoming in the coming weeks and months. (Disclaimer, these tests have yet to be accepted by the WG).

A forthcoming blog entry will delve into the basis of a simple implementation, and what it takes to completely conform to the specification.

Release 1.1.0 of Ruby RDF libraries

I’m happy to announce release 1.1.0 of the Ruby RDF suite, this includes support for RDF 1.1 Concepts and syntaxes, including RDF Literal changes, N-Triples, N-Quads, TriG and JSON-LD. Additionally, the SPARQL gem has been updated to support RDF 1.1 differences (mainly differences in plain literals vs those with xsd:string datatypes and the fact that language-tagged literals now have the rdf:langString datatype).

There are some incompatibilities with previous releases described below:

  • Support for Ruby 1.8.* has been eliminated from all gems, however the latest versions of JRuby and Rubninus are supported.

  • RDF.rb: https://github.com/ruby-rdf/rdf

    • Works on versions of Ruby >= 1.9.2 (1.8.* support now dropped), this includes JRuby and Rubinius 2.1+ (true for all gems)
    • Support for all RDF 1.1 concepts (other than skolumization, if anyone’s really interested in this, we could add support in a future release),
    • Native implementation of RDF::URI (aka RDF::IRI) without using Addressable::URI; this is actually a big performance win, as URI’s typically don’t need to be parsed now,
    • RDF::Graph cannot take a context (name) unless it is a projection of a named graph from an RDF:: Repository (aka RDF::Dataset),
    • Support for 1.1 versions of N-Triples and N-Quads
    • Support for changes to Literal semantics, meaning that all literals have a datatype. #plain? and #simple? accessors still exist, which have expected results.
    • RDF::Query#execute now accepts a block and returns RDF::Query::Solutions. This allows enumerable.query(query) to be have like query.execute(enumerable) and either return an enumerable of yield each solution.
    • RDF::Queryable#query returns a concrete Enumerator extending RDF::Queryable and RDF::Enumerable or RD::Query::Solutions; this improves performance when accessing enumerator methods as they’re not extended dynamically (see issue #123: “Performance and Kernel#extend”)
    • RDF::Util::File.open_file (used for Mutable.load, among others) no honors 303 and Location: headers to maintain the source identifier and try to manage all input data as UTF.
    • RDF::StrictVocabulary is now added as a sub-class of RDF::Vocabulary; most built-in vocabularies are now strict, meaning that an attempt to access a term not defined in the vocabulary will raise an error. Many more vocabuarlies added including iCal, Media Annotations, Facebook OpenGraph, PROV, SKOS-XL, Data Vocabulary, VCard, VOID, Power-S, XHV and schema.org.
    • New vocabulary definitions have been added for ICal, Media Annotations (MA), Facebook OpenGraph (OG), PROV, SKOS-XL (SKOSXL), Data Vocabulary (V), VCard, VOID, Powder-S (WDRS), and XHV.
  • RDF::N3: https://github.com/ruby-rdf/rdf-n3

    • No longer detects formats when a specific content type is not known; this prevents problems interpreting Turtle as N3.
  • RDF::RDFXML: https://github.com/ruby-rdf/rdf-rdfxml

    • Now fully supports JRuby; previously, the writer only worked on MRI and Rubinius. Writer now uses Haml templates, and extends the RDFa writer.
  • RDF::Turtle: https://github.com/ruby-rdf/rdf-turtle, RDF::TriG: https://github.com/ruby-rdf/rdf-trig

    • Full support for RDF 1.1 specifications
  • SPARQL: https://github.com/ruby-rdf/sparql

    • All query operators now accept a block for returning solutions in addition to returning the RDF::Query::Solutions.
    • Specific Queryable implementations can now optimize any part of the SPARQL Algebra execution chain by implementing the appropriate operator as all calls go through the RDF::Queryable object. This can be used by specific storage adaptors to implement part or all of the query chain using implementation-specific optimizations.
    • Minor test harness changes to tollerate RDF 1.1 changes.
  • Rack::LinkedData, Sinatra::LinkedData, Rack::SPARQL, Rack::Sinatra

    • All updated for latest versions of Rack and Sinatra. Note that Sinatra does not support returning the enumerable without a content type for content-negotiated serialization of the results, as Sinatra defaults results to text/html. Use linkeddata_options (or sparql_options) to set desired content type or format of result. Prioritized ACCEPT types provided through environment.

All gems now support only Ruby >= 1.9.2, JRuby and Rubinius.

This includes the following individual gems:

  • Core RDF and repositories:

    • rdf
    • rdf-aggregate-repo
    • rdf-do
    • rdf-isomorphic
    • rdf-mongo
    • rdf-spec
  • RDF Serialization formats

    • json-ld
    • rdf-json
    • rdf-microdata
    • rdf-n3
    • rdf-raptor
    • rdf-rdfa
    • rdf-rdfxml
    • rdf-trig
    • rdf-trix
    • rdf-turtle
    • rdf-xsd
  • Querying

    • sparql
    • sparql-client
    • spira
  • Rollup releases, including most of the above gems

    • linkeddata
    • rack-linkeddata
    • sinatra-linkeddata

All versions of this are running on my distiller: http://rdf.greggkellogg.net/. Any comments or issues found prior to release are appreciated.

I’d like to thank David Butler, Judson Lester, Justin Coyne, Slava Kravchenko, and Aymeric Brisse in particular for their active support in maintaining the Ruby RDF eco-system.

RDF.rb and SPARQL gem updates

RDF.rb and SPARQL updates
Version 1.0.6 of RDF.rb and 1.0.7 of SPARQL gems released.

I recently pushed updates to RDF.rb and SPARQL gems. These updates contain some useful new features:

RDF.rb

The main RDF gem has many updates to bring it closer to the coming update to the RDF 1.1 specifications (RDF Concepts, RDF Semantics, Turtle, TriG N-Triples, N-Quads, and JSON-LD). Notable changes since the 1.0 release are:

  • Make distinction between Plain and Simple literals; Simple literals have no language, plain literals may also have a language. And, in preparation for RDF 1.1, literals having a datatype of xsd:string are considered Simple literals. Language-tagged literals may have a datatype of rdf:langString.
  • Improved support for queries using hash patterns (thanks to @markborkum).
  • Update N-Triples (and N-Quads) readers and writers to support the RDF 1.1 version, including new escape sequences and support for UTF-8, in addition to ASCII. When writing characters are escaped depending on the specified encoding or that inferred from the output file.
  • Term and Statement comparison is dramatically improved improving statement insertion and query performance.
  • Other literal changes required to support SPARQL 1.1

On the 1.1 branch:

  • RDF::URI re-written to not require Addressable. This improves general performance of using URIs by about 50% (depends on 1.9+ features, so not included in the 1.0 branch).
  • Support for Ruby versions less than 1.9.2 is dropped.

SPARQL

The SPARQL gem is not based on the SPARQL 1.1 grammar, and now includes some features from SPARQL 1.1, including all functions and builtins and variable bindings. Look for new features to be added incrementally; once a critical mass is reached, I’ll update the gem version to 1.1 to reflect that this is essentially a 1.1 compatible version of SPARQL. Eventually this will include SPARQL Update and Federated Queries as well.

Other gems

  • JSON::LD is released as version 1.0.0 and is full compatible with the last-call version of the JSON-LD specifications, including support for framing.
  • RDF::Turtle is fully compatible with the RDF 1.1 version.
    • Also includes a Freebase-specific reader for fastest performance reading Freebase dumps.
  • RDF::RDFa is compatible with RDFa Core 1.1 and HTML+RDFa 1.1
  • RDF::TriG is released as a 1.0.0 version, based on the RDF 1.1 note
  • RDF::Raptor now uses Raptor 2.0, and is fully compatible with the latest version of Redland/Raptor.

RDF.rb 1.0 Release

I’m happy to announce the 1.0 release of RDF.rb and related Ruby gems. This release has been a long time coming, and the library has actually been quite stable for some time.

RDF.rb is a Ruby Gem implementing core [RDF][] concepts, such as Graph, Repository, Statement, and Query. It also supports core readers and writers (parsers and serializers) for N-Triples and N-Quads.

Through other gems, more readers and writers are implemented, including:

Additional readers and writers are available through Redland Raptor bindings using the RDF::Raptor gem.

In addition to native support for BGP-based queries, there is a full-conferment SPARQL 1.0 gem (SPARQL gem), and SPARQL 1.1 client gem (SPARQL::Client).

All of these gems can be packaged together using the Linkeddata, Rack::LinkedData, and Sinatra::LinkedData gems.

There are also a number of storage adaptors for popular backends, including the following:

Background

I first became involved with RDF when working on the Connected Media Experience (CME) design (see blog entry). Having designed a proprietary metadata standard for Gracenote and Warner Music Group (later for CME), through review with Lucas Gonze, I was introduced to the Music Ontology, which had done an extremely thorough modeling of the music domain. This caused significant debate in the CME community, which lead to an updated design based on RDF, and many of the ideas from the Music Ontology.

During the same review, I was also introduced to RDFa as a mechanism of embedding music metadata within a web page. Since CME is endeavoring to create a standard for enhanced digital media packages, HTML5 and RDFa are natural technologies to utilize.

My methodology for approaching architectures and specifications is based on parallel prototyping, to validate the details of a design, and sometimes serve as the bases for an implementation. For several years (starting in 2007) I had been using Ruby on Rails and was quite invested in the Ruby. My first attempts to integrate RDFa into the implementation made use of the Raptor parser, which unfortunately was not up-to-date with respect to the RDFa 1.0 specification current at the time. Additionally, I found that the Ruby Bindings suffered from memory leaks, and went to look for a native Ruby implementation. This lead to Tom MorrisReddy gem. This was a port of the Python rdflib package to Ruby, which was going in the right direction, but had fallen into disuse. I created my own fork, and later released as RdfContext which had complete implementations for RDF/XML, RDFa 1.0, and Notation3 (N3-rdf level).

In the mean time, Arto Bendiken and Ben Lavender had been working on RDF.rb, taking a different approach closer to Sesame. The design of RDF.rb is quite elegant, making effective use of natural Ruby idioms, and taking an approach based heavily on using Ruby module extensions. After a prod by Nick Humfrey who had started a port of my RDF/XML parser, I jumped in and ported the bulk of my RDF parsers and serializers, eventually adding several more, to make the RDF.rb platform one of the most complete in terms of standards support across all major (and minor) serialization formats. In 2010, I was asked to join the RDF.rb core development team, and soon became the primary maintainer after Arto and Ben became fully committed with Dydra.

Arto and I finally got together recently to move all of the gems primary repositories to the Ruby RDF organization on GitHub, and release them as 1.0. The next significant release should be 1.1, to coincide with the release of the RDF 1.1 specs from the W3C.

JSON-LD and MongoDB

For the last several months, I’ve been engaged in an interesting project with Wikia. Wikia hosts hundreds of thousands of special-interest wikis for things as varied as pokemon, best cellphone rate comparisons, TV shows and Video Games.

For those of you not aware of Wikia, it is an outgrowth of the MediaWiki and was founded by Jimmy Wales as a for-profit means of using the MediaWiki platform for exactly such interests.

Recently MediaWiki Deutschland started work on WikiData, an effort to use Semantic Web principles to create a factual knowledge base that can be used within Wikis (typically to replace Infobox information, which can vary between different language versions). This is a somewhat different direction than Semantic Media Wiki, which is more about using Wiki markup to express semantic relationships within a Wiki. As it happens JSON-LD is being considered as the data representation model for WikiData.

Linked Data at Wikia

As it turns out, Wikia has been quite interested in leveraging these tools. I did mention that Wikia is a for-profit company; one way they do this is through in-page advertising, but the amount of knowledge curated by the hundreds of thousands of communities is staggering. Unfortunately, native Wiki markup just isn’t that semantic. However, much of the information represented is factual (at least within the world-view of the wiki community).

To that end, I’ve been working on an experiment using JSON-LD and MongoDB to power a parallel structured data representation of much of the information contained in a wiki. The idea is to add a minimal amount of markup (hopefully) to the Wiki text and templates so that information can be represented in the generated HTML using RDFa. This allows the content of the Wiki to be mirrored in a MongoDB-based service using JSON-LD. Once the data has been freed from the context of the limited Wiki markup, it can now be re-purposed outside of the Wiki itself.

Knowledge modeling and data representation

Why use RDFa and not Microdata? The primary driver is the need to use multiple vocabularies to represent information. In my opinion, any new vocabulary needs to take into consideration schema.org; microdata works great with schema.org, and can generate RDF (see Microdata to RDF) as long as you’re constrained to a single vocabulary, don’t need to keep typed data, and don’t need to capture actual HTML markup. Unfortunately, any serious application beyond simple Search Engine Optimization (SEO) does need to use these features. In our case, much of the interesting data to capture are fragments of the Wiki pages themselves. Moreover, the content of any Wiki, much less one that has as much special meaning as, say, a Video Game, needs to describe relationships that are not natively part of the schema.org vocabulary. Schema does provide an extension mechanism partly for this purpose, and recently the ability to tag subjects with an additional type, not part of the primary vocabulary (presumably schema.org) was introduced. But, once the decision is made to use multiple vocabularies, RDFa has better mechanisms in place anyway.

At Wikia, we define a vocabulary as an extension to schema.org, that is, the classes defined within that vocabulary are sub-classes of schema.org classes, although typically the properties are not sub-properties of schema.org properties (we may revisit this). For example, a wikia:VideoGame is a sub-class of schema:CreativeWork, and a wikia:WikiText is a sub-class of schema:WebPageElement. There are additional class and property definitions to describe the structural detail common to Video Games in describing characters, levels, weapons, and so forth. An RDFa description will assert both the native class (e.g., wikia:VideoGame) and the schema.org extension class (e.g. schema:CreativeWork/VideoGame). This allows search engines to make sense of the structured data, without the need to understand an externally defined vocabulary.

However, for Wikia’s purposes, and that of people wanting to work with in the Wikia structured-data echo-system, having a vocabulary that models the information contained within Wikia Wikis can be of great benefit. Key to this is knowning how much to model with classes and properties, and how much to leave to things such as naming conventions and keywords. In fact, there are likely cases where more per-wiki modeling is required, and we are continuing to explore ways in which we can further extend the vocabularies, without imposing a large burden on ontology development, and to keep the data reasonably generically useful.

Linked Data API

Although RDFa structured in HTML can be quite useful as an API itself, modern Single Page Applications are better served through RESTful interfaces with a JSON representation. JSON-LD was developed as a means of expressing Linked Data in JSON. It is fully compatible with RDF. Indeed, many of the concepts used in RDFa can be seen in JSON-LD – Compact IRIs, language- and datatyped-values, globally identified properties, and the basic graph data model of RDF.

Furthermore, a JSON-LD-based service allows resource descriptions, that may be spread across multiple HTML pages, to be consolidated into individual subject definitions. By storing these subject definitions in a JSON-friendly datastore such as MongoDB, the full power of a scaleable document store becomes available to the data otherwise spread out across numerous Wiki pages. But, the fact that the JSON-LD can be fully generated from the RDFa contained in the generated Wiki pages, ensures that the data will remain synchronized.

In the future, with the growth and adoption of systems such as WikiData, we can expect to see much of the factual information currently expressed as Wiki markup moved to outside services. The needs of the Wiki communities remain paramount, as they are at the heart of the data explosion we’ve seen in the hundreds of thousands of Wikis hosted at Wikia and elsewhere, not to mention WikiPedia and related MediaWiki projects.

As the communities become more comfortable with using knowledge stores, such as WikiData and Wikia’s linked data platform, we should see a further explosion in the amount of structured information available on the web in general. The real future, then, relies not only in the efforts of communities to curate their information, but in the ability to use the principles of the Semantic Web and Linked Data to infer connections based on distributed information.

I’ll be speaking more about JSON-LD and MongoDB at NoSQL Now! later this week in San Jose. Slides for my talk are available on slideshare.

Channel Islands May 2012

Just back from a great trip to Santa Cruz Island over Memorial day. Did some teaching, and had just about two dives to do some Underwater Macro Photography. I recently purchased a Canon 100 mm USM, which I got to check out. Water was pretty murky, but being able to get just inches away from the subjects does wonders. While shooting one of the Nudi’s a Harbor Seal started tugging on my fins, and came right around in front; that’s when I wished I had had a wide angle lens too. Check out the photo gallery.

Hermissenda crassicornis

BrowserID versus DDOS

BrowserID vs DDOS
How BrowserID saved the RDFa Test Suite from a DDOS (Distributed Denial of Service Attack).

This article is the third in a three-part series on implementing the RDFa
Test Suite
. The first
article

discussed the use of Sinatra, Backbone.js and Bootstrap.js in
creating the test harness. The second
article

discussed the use of JSON-LD. In this article, we focus on our use of
BrowserID in responding to a Distributed Denial of Service Attack
(DDOS).

RDFa Test Suite

Working on the updated RDFa Test Suite has really been a lot of fun.
It was a great opportunity to explore new Web application technologies,
such as Bootstrap.js and Backbone.js. The test suite is a
single-page web application which uses a Sinatra based service to run
individual test cases.

The site was becoming stable, and we were starting to flesh out more test
cases for odd corner cases, when the site started to not respond. Manu
Sporny
, who’s company Digital
Bazaar
is kindly donating hosting for the web
site, noticed that there were a number of Ruby processes that were
consuming available Ruby workers, and causing new requests to
block. The service is fairly resource intensive, as it must invoke an
external processor and run a SPARQL query over the results for each
test. It seemed as if the site was being hammered by a large number of
overzealous search crawlers! Naturally, we put a robots.txt in place,
expecting that conforming search engines would detect the site’s crawl
preferences and back off, but that didn’t happen. Upon further examination
of the server logs, we noted requests were streaming in from all over the
world! Clearly, we were under attack. (Who might wish ill of the RDFa
development effort? Who knows, but most likely this was just an anonymous,
and not specifically malicious attack).

My first thought was to make use a secret api token, configured into the
server and the web app, but that didn’t really do the trick either; it
seemed that modern day malware actually just executes the JavaScript, so
it picks up the API key naturally!

BrowserID to the Rescue!

Okay, how about authentication? It’s typically a pain, and we were
reluctant to put up barriers in front of people who might want to test
their own processors or see how listed processors perform. The two current
contenders are WebID and BrowserID.

WebID has the laudable goal of combining personally maintained profile
information with SSL certificates (it was previously known as FOAF+SSL).
Basically, it’s a mechanism to allow users to use a profile page as
their identity. This could come off of their blog, Facebook, Twitter or
other social networking site. By configuring an SSL certificate into the
browser and pointing to their profile page, a service can determine that
the profile page actually belongs to the user. (There’s much more to it,
you can read more in the WebID
Spec
). A key advantage here
is that the service now has access to all of the self-asserted information
the user want’s to provide about themselves as defined in their profile
page, such as foaf:name, foaf:knows, and so forth. The chief downside
is that the common source of existing user identities in the world haven’t
bought into this, and there’s a competing solution that offers similar
benefits.

BrowserID is a Mozilla initiative to enable people with e-mail
addresses to use those e-mails to login to websites, kind of like
[OpenID][] – only more secure. Basically, as I understand it, a service
wanting to support this would include the BrowserID JavaScript client
code in their application and use a simple Sign In button that invokes
this code. That sends a request off the the identity provider (IDP) to
authenticate the user, which has probably already happened in the past and
maintained in a cookie. The IDP then sends a response which invokes a
callback. The client then does a call back to the service to complete the
login passing the assertion.

The beauty is, using a tool such as the sinatra-browserid Ruby gem,
this becomes dirt simple! Basically, on the API side, put in a call to
authorized? to determine if the user is authorized. If not, either
direct them to a login screen, or in the case of the RDFa Test Suite,
place an informational message telling them why we need them to login, and
identify the BrowserID button at the top of the page.

In the principle entry-point to the test suite on the service side is
/test-suite/check-test/:version/:suite/:num. The only real change to
this method was to check for authorization before performing the test.

# Run a test
get '/test-suite/check-test/:version/:suite/:num' do
  return [403, "Unauthorized access is not allowed"] unless authorized?

  # Get the SPARQL query
  source = File.open(File.expand_path("../tests/#{num}.sparql"))

  # Do host-language specific modifications of the SPARQL query.
  query = SPARQL.parse(source)

  # Invoke the processor and retrieve results, parsed into an RDF graph
  graph = RDF::Graph.load(params['rdfa-extractor'] + test_path(version, suite, num, format))

  # Run the query
  result = query.execute(graph)

  # Return results as JSON
  {:status => result}.to_json
end

In the banner, we add a little bit of Haml:

...
%div.navbar-text.pull-right
  - if email
    %p.email
      Logged in as
      %span.email
        = email
      %a{:href => '/test-suite/logout'}
        (logout)
  - else
    = render_login_button

When the page is returned, the email variable is set if the user is
authorized, so they’ll see the email address if they’ve authenticated, and
a login button otherwise. The render_login_button has handled entirely
by sinatra-browserid; no muss, no fuss!

The only other thing to do is to not show the test cases in the test
suite, unless the user has authenticated, which we can tell because
$("span.email") won’t be empty. In our application.js, we use this to
either show the tests, or an explanation:

// If logged in, create primary test collection and view
if ($("span.email").length > 0) {
  this.testList = new TestCollection([], {version: this.version});
  this.testList.fetch();
  this.testListView = new TestListView({model: this.testList});
} else {
  this.unauthorizedView = new UnauthorizedView();
}

That’s pretty much all there is too it. The only complication I faced is
that, when developing with shotgun, the session ID is changed with each
invocation, so it wasn’t remembering the login. By fixing the session
secret this problem went away. Total time from discovery of the problem to
deployed solution: about 1 hour. Not too bad.

It’s important to note that the RDFa Test Suite is stateless, and we
don’t really need any personal information; we don’t collect information
anywhere, even in our logs. BrowserID basically becomes a gate keeper
to help ward off abuse. It imposes a very low barrier of entry, so it
doesn’t interfere with people using the site anyway they choose.

I do miss other user asserted information, such as the user’s name and
so-forth. OpenID, another single-signon initiative
that has lost momentum lately, provides a Simple Registration
Extension

add-on that allows users to assert simple information such as nickname,
mail, fullname and so forth. IMO, the right way to do this is with
something like FOAF or the schema.org Person class. Perhaps
BrowserID will provide something like this in the future.