librdfa-rdf4j

Description

librdfa-rdf4j is a RDFParser that is built on top of librdfa, which claims to be the fastest RDFa processor. The librdfa processor is written in C for XML and HTML languages. In order to connect Java code with the C processor, the RDF4J parser uses SWIG. In project folder, there is an annotated C header file that creates wrapper code, making available the librdfa library in Java. If you want to see the C code developed for the bridge between librdfa and Java, please check the c folder of the parser.

In order to use the parser, you can just add the dependency as a normal maven project. Later on, you can use the Rio API when providing the RDFFormat.RDFA.

Performance

There is a benchmark on the test folder of the project. We compare librdfa-rdf4j with Semargl. We can compare the performance in terms of number of triples. Here are some conclusions that we have observed:

  • For a small number of triples (<1000),  librdfa-rdf4j is faster. Results with 900 triples: Semargl 2.70 ms;  1.94 ms.
  • For a mid size number of triples (>1000, <5000), Semargl is faster but there is a small difference. Results with 3000 triples: Semargl 3.96 ms; librdfa-rdf4j 4.13 ms.
  • For a big number of triples (>5000),  Semargl is faster. Results with 15000: Semargl 31.31 ms; librdfa-rdf4j 45.97 ms.

In general, librdfa is faster than Semargl, but there is some slowness because of the implementation of Rio. Rio loads the dataset into an InputStream before parsing it. However, librdfa parses the triples as they arrive. As a result of this, librdfa-rdf4j first needs to load the dataset into an InputStream and later send the data the C buffer through the Java-C bridge.

Requirements

librdfa-rdf4j uses librdfa library. So, librdfa needs to be installed beforehand. Please follow the installation steps in the librdfa repository.

In general, you need to clone the repository

git clone https://github.com/rdfa/librdfa

And install the library (make sure to have all the libraries that librdfa uses).

./autogen.sh
./configure
make
make install

Building from source

In order to compile librdfa-rdf4j, change into the source directory and execute install using maven.

mvn clean install

Install

You can install libdrfa-rdf4j adding the following maven dependency (make sure to have installed the the librdfa library):

<dependency>
   <groupId>org.apache.any23</groupId>
   <artifactId>apache-any23-librdfa</artifactId>
   <version>${librdfa.rdf4j.version}</version>
</dependency>

Use

Once you have installed librdfa-rdf4j, you can use the parser with the Rio API. For example:

RDFParser rdfParser = Rio.createParser(RDFFormat.RDFA);
Model model = new LinkedHashModel();
rdfParser.setRDFHandler(new StatementCollector(model));
rdfParser.parse(in, "http://www.example.org./");


librdfa extractor

Any23 uses by default Semargl with the standard RDFa 1.1. However, you can change it setting the property any23.extraction.rdfa.programmatic to off to use Semargl with the standard RDFa 1.0. And in order to use librdfa extractor you just need to set the property any23.extraction.rdfa.librdfa. If the librdfa property is set, it will override the Semargl property without regard the value that is set. By default the librdfa property is off. After, you change the extractor you can use Any23 as usual. 

Remember to install librdfa library to use the librdfa extractor.

In order to change the property,  you can set the ANY23_OPTS environmental variable or setting the property in the Configuration class. Check the official documentation for more details.

  • No labels