Extracting shapes from large RDF data collections

Daniel Fernández Álvarez Yasunori Yamamoto Jose Emilio Labra Gayo Andra Waagmeester

Abstract

There is an increasing number of projects based on RDF graphs. Shape languages, such as SHACL and ShEx, have been proposed to support the evolution of such projects on two main aspects: description and validation of RDF content. However, producing shapes for an existing knowledge graph is an arduous and time-consuming task when dealing with large data sources. Automatic shape extractors are software elements that allow us to tackle such issue. They can produce RDF shapes by exploring existing RDF content. However, these tools usually suffer from scalability issues related to memory availability in those scenarios where they could be more useful: large data sources. To deal with these situations, some extractors implement sampling strategies. They extract shapes from a representative part of the input data rather than using the whole dataset. However, such mechanisms may lose some features which are not frequent among the input data. We propose an alternative approach based on splitting the original input into parts, running the extraction process over each part, and consolidating the obtained result in a single schema. We demonstrate through experimentation that our approach can outperform sampling w.r.t. of quality of the obtained results. The software used for these experiments is publicly available.

Type

Conference proceedings

Publication

In Semantic Web Applications and Tools for Health Care and Life Sciences, SWAT4HCLS.

Date

March, 2024