Validating data with SHACL

Validating data with SHACL

SHACL is a new approach for validating RDF data, and is at this moment a working draft at the World Wide Web Consortium (W3C). At the eInnsyn project, a collaborative project between Agency for Public Management and eGovernment (Difi) and the Oslo municipality (Oslo kommune), we have now successfully implemented a SHACL engine and are using SHACL for data validation. This gives us SPARQL free data validation, which limits the time cost and are much easier to read and use. In this blog post, I will describe the main features of SHACL and how we are converting existing XSDs to SHACL constraints.

What is SHACL?

Shape Constraint Language (SHACL) is a language for describing and constraining the content of RDF graphs. SHACL groups constraints into shapes, which specify conditions that apply to a given RDF node. Your input data therefore needs to fit a specific shape to be valid.

This blog post will not go into the specific details of SHACL, but rather give an simple overview and focus on generating these shape constraints from an existing XSD and the validation report produced by a SHACL engine.

Using existing XSDs

Working with a Norwegian data format for archive data, called Noark 5, there already exists several well described XSDs. Since we decided on converting the Noark formats to RDF for our system and to try out SHACL for validation, we saw that our shape constraints may be generated from the information we already knew through the XSDs, rather than manually write these constraints. I will now describe our approach to the problem, and explain the shape constraints in detail. Be aware of that we are only using a subset of the core constraints of SHACL.

Parsing the XSDs

There are two Noark 5 XSDs which we needed information from, one describing the complex types including their elements, and one describing the expected restriction (data type) for each element. These restrictions may also be a list of enum values.

By using the Java document parser, we may easily get hold of the complex type names and elements, and then traverse down into the elements of the XSD.

However, even if this provides an easy-to-understand implementation, it is high on cost of time. Since we have to traverse first the complex types, then the sequences before we may traverse the elements. There is a need of minimum three nested for-loops, which will give us an worst case time complexity of O(n3). Keeping the time complexity in mind, the XSDs for this project are not really big, so the execution time of this solution is not a problem to be dealt with at this time.

Creating the shape constraints

We will add every shape constraint into a model using the Jena framework.

Model shapes = ModelFactory.createDefaultModel();

For each shape constraint, we need to know the name of the shape. This is equivalent to the complex type name. We may then create a resource which we eill use as subject for the properties in this shape.

Resource baseSubject = shapes.createResource(NS + complexTypeName + "Shape");

Where NS is the namespace for this project, and "Shape" is the postfix of the shape name.

At the inner loop of the generation, we know the element name and the element type. Here we have enough knowledge from the XSD to create the shape constraints to build up our shape.

For each and every element, we need a blank node that will be the object of a sh:property-triple.

Resource property = shapes.createResource();

Following is an example of retrieving the predicate name and the minimum occurrence of this predicate.

shapes.add(baseSubject, SHACL.property, property);
shapes.add(property, SHACL.predicate, shapes.createResource(NS + elementName);
if (!elementElement.getAttribute("minOccurs").equals("0")) {
  shapes.add(
    property,
    SHACL.minCount,
    shapes.createTypedLiteral(1)
  );
}

Using a similar approach for maxOccurs and type we can build up shapes that are satisfiable for our SHACL engine. The SHACL engine will then take this shape constraints and a data graph as input, and see whether the data fits the shapes. If they do not fit, a result graph is returned.

Shape constraints

Now, let us have a look at the output of the shape constraint generation. Note: the shapes are described in Norwegian, since the Noark 5 structure is all defined in Norwegian. I have translated the examples for this blog posts.

@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ex: <http://example.org/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
ex:DeletionShape
  a sh:Shape ;
  sh:scopeClass ex:Deletion ;
  sh:property [
    sh:datatype xsd:dateTime ;
    sh:maxCount "1"^^xsd:int ;
    sh:minCount "1"^^xsd:int ; 
    sh:predicate ex:deletedDate ;
    sh:severity sh:Violation
  ] .

As mentioned earlier, in eInnsyn we only accept a subset of the core constraints from SHACL. This is an example of constraints which will fit a triple with an expected literal in the object position. The example includes the constraints datatype, maxCount, minCount and predicate, it also contains a severity. When we are using this shape to fit our data, the data needs to fulfill the constraints. That means that every triple where anything that has a predicate deletedDate needs to have this predicate exactly once (minCount and maxCount), and the literal of this predicate has to be of xsd:dateTime.

Why use SHACL?

Every constraint in SHACL has a corresponding SPARQL query definition that resolve the constraint. In previous projects I have been using SPARQL to validate the input data, and writing (and executing) several SPARQL queries takes time and complex SPARQL queries may be somewhat difficult to read and understand. Using SHACL we can achieve SPARQL free validation, as we now have done in the eInnsyn project. And with SHACL being written in any RDF syntax, it is easy to both write and read in your syntax of choice.

Let’s have a look at a SPARQL definition for a shape constraint. We will be using datatype as an example [1].

Definitions of sh:datatype

Parameters

Property

Value type

Summary

sh:datatype

rdfs:Resource

Datatype of all value nodes (e.g., xsd:integer)

Textual definition

A validation result must be produced for each value node that is not a literal, or is a literal with a mismatching data type. A literal matches a data type if the literal's datatype has the same IRI.

SPARQL definition

SELECT $this ($this AS ?subject) $predicate (?value AS ?object)
WHERE {
  $this $predicate ?value .
  FILTER (!isLiteral(?value) || datatype(?value) != $datatype) .
}

The SHACL working draft contains similar definitions for each core constraint of the vocabulary.

Validation results

Then, what happens if the input data graph does not fit the shapes? As mentioned, a SHACL validation engine takes two immutable RDF graphs as input, a shapes graph and a data graph. The engine shall validate the data graph against the shapes graph. The following is the formal definition of the validation [2].

  • A node validates against a shape iff either it does not validate against some filter of the shape or none of the constraints in the shape produce a validation result with severity sh:Violation for the node.

  • A data graph validates against a shape iff each node that is in any of the scopes of the shape validates against the shape.

  • A data graph validates against a shape graph iff the data graph validates against each shape in the shapes graph.

The validation process returns a validation report containing all results, by default this should contain results of all severity levels. However, the user may request results with a custom minimum severity.

Severity

SHACL describes three kinds of severity levels of validation.

Severity

Description

sh:Info

An informative message, not a violation.

sh:Warning

A non-critical constraint violation indicating a warning.

sh:Violation

A constraint violation that should be fixed.


In eInnsyn we have set every property constraint with the severity level sh:Violation for now.

SHACL Validation Result Vocabulary

The validation results produced by an SHACL engine must contain the errors from the data graph only. In addition to severities, each validation results contains a set of values that are described in the SHACL Validation Results Vocabulary. The following graph is an example of a validation result.

ex:ExampleConstraintViolation
  a sh:ValidationResult ;
  sh:severity sh:Violation ;
  sh:focusNode ex:Journalpost123 ;
  sh:subject ex:Journalpost123 ;
  sh:predicate ex:deletedDate ;
  sh:object "ABC" ;
  sh:message "ex:deletedDate expects a literal of datatype xsd:dateTime." ;
  sh:sourceConstraintComponent sh:DatatypeConstraintComponent .

All validation results must be SHACL instances of the class sh:ValidationResults.

Validation predicates

Description

sh:focusNode

Points to an IRI or blank node that caused the result.

sh:subject, sh:predicate and sh:object

Validation results are often caused by a single RDF triple, or a predicate of given subject or object. Information about this may be encoded via these properties.

sh:message

Communicates textual details to humans. There should not be two message values with the same language tag.

sh:severity

The severity level, as described in previous section.

Source

Validation results may link to a sh:Constraint that caused the result via sh:sourceConstraint. The sh:Shape is linked with sh:sourceShape. Constraint component causing the result is linked via sh:sourceConstraintComponent.

sh:detail

May link a parent result that provide further details about the cause of the parent result.


In addition to this vocabulary, we did introduce a couple more names for eInnsyn. While developing and testing our validation, we discovered the need to know the actual and expected values. We are calling our addition SHACL Extended. Note that SHACL Extended is NOT a part of the official SHACL vocabulary.

SHACL Extended

The main names of our SHACL Extended vocabulary is the sh-ext:actual and sh-ext:expected. These predicates are meant to give an enrichment to the validation results produced by our SHACL engine when a data graph does not pass the validation. The following is an example of use.

ex:ExampleConstraintViolation
  a sh:ValidationResult ;
  sh:severity sh:Violation ;
  sh:focusNode ex:Journalpost123 ;
  sh:subject ex:Journalpost123 ;
  sh:predicate ex:deletedDate ;
  sh:object "ABC" ;
  sh-ext:actual xsd:string ;
  sh-ext:expected xsd:dateTime ;
  sh:message "ex:deletedDate expects a literal of datatype xsd:dateTime." ;
  sh:sourceConstraintComponent sh:DatatypeConstraintComponent .

In this example we are expecting a xsd:dateTime, but the actual data type in the triple is xsd:string. Actual and expected values are machine readable elements of the validation result graph, rather than a plain text in the sh:message. This gives us the possibility to do something about the validation error in our system, rather being just a message that needs to be read by a human being. 

Summary

The SHACL engine of our eInnsyn project is implemented as an abstract syntax tree (AST) and is therefore easy to extent to support even more of the SHACL constraints given by the W3C standard. If you want to know more about our solution, do not hesitate to contact us, and we will be happy to answer any questions you may have.

Sources

[1] Example taken from the SHACL working draft https://www.w3.org/TR/shacl/#DatatypeConstraintComponent.

[2] Validation definition cited from the SHACL working draft https://www.w3.org/TR/shacl/#validation.

 

Om bloggeren:
Veronika har studert programmering og nettverk ved Universitetet i Oslo og har en forkjærlighet for logikk, semantiske teknologier, typografi og elektronikk.

comments powered by Disqus