Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Proposal: ANY23-295  Implement ability to use librdfa

Description

 

Student

In 2012, the Any23 community decided to migrate from its own RDFa parser implementation to Semargl[1] as discussed in [2]. Semargl is a modular framework for crawling linked data from structured documents [9] which provides a RDFa parser compatible with RDF4J through an integration module [3]. Since that issue [4] was closed, Semargl turned into the official RDFa parser for Any23.

Lewis John McGibbney re opened the discussion proposing to test librdfa [5], a C/C++ library which claims to be ‘The fastest RDFa processor on the Internet’ and supports RDFa 1.0 and 1.1 in many varieties such XML+RDFa, XHTML+RDFa, etc. That idea was launched in order to evaluate what kind of performance boost could Any23 achieve in parsing RDFa by using a native parser and how well librdfa would integrate with Any23.

In this context, the present proposal aims to accomplish the aforementioned objective and provide an seamless integration between Any23 and librdfa parser, which allows to conduct a fair performance comparison between Semargl and librdfa within Any23.

Student

Julio Caguano 

Mentor

Lewis John McGibbney

...

Full Proposal

Proposal Title : Integrate and evaluate librdfa RDFa parser into Any23 via JNI (Java Native Interface) [10].

Student Name: Julio Caguano .

Student Email : julio.caguanob@gmail.com

JIRA Issues: https://issues.apache.org/jira/browse/ANY23-295   Implement ability to use librdfa

Project Deliverables

  • New standalone module with a new RDFa parser compatible with RDF4J using librdfa.

  • JNI bridge to librdfa including interfaces and middleware utilities.

  • Unit tests for the new librdfa module

  • Benchmark tests comparing Semargl and librdfa.

  • Self-maintaining Any23 Website documentation which will vizualize integration test results in addition to Any23 compliance against the http://rdfa.info/test-suite/

Detailed description

Anything to Triples (Any23) is a library, a web service and a command line tool that extracts structured data in RDF format from a variety of Web documents. Currently it supports the following input formats: RDF/XML, Turtle, Notation 3, RDFa ... [6]. As explained in the initial description of this document, Any23 community would like to test new and probably more efficient mechanisms for processing data. This proposal specifically covers the RDFa format and how it is parsed within Any23 putting forward the integration and evaluation of a new RDFa parser based on the C/C++ library (librdfa). This integration problem is intended to be addressed via JNI using a set of interfaces and middleware utilities, which will be documented and evaluated.

Scope for the project


This project will be involved in the implementation of a new RDFa parser for Any23, which serves as a wrapper for the librdfa library. The project will also include a evaluation phase for measuring the improvements or drawbacks of using such parser as the main Any23 RDFa processor.

Design

The implementation process will rely on the pre existing parsers infrastructure of Any23 which is provided by RDF4J and will use JNI as integration mechanisms for librdfa. The development of the project will be divided in  three phases.

 

Implementation Approach

...

Bridge: This phase will tackle the communication issues between Any23 and librdfa and will be mainly focused on:

  • Loading librdfa binaries into Java.

  • Sending data streams from Java to librdfa (Documents’ content).

  • Sending parsing configurations to librdfa (ParserConfig parameters - RDF4J).

  • Handling and throwing exceptions.

  • Retrieving statements (triples) from librdfa to Java.

The Java objects to C structures and vise versa translation probably will be implemented with Protobuf [7] or similars, but it could be depend on the real issues that arise during developing time.

Wrapper: This phase will focuses on fulfilling the RDF4J interfaces in order to warranty compatibility with the existing parsers and other components of Any23. This phase will mainly deal with:

  • Implement the necessary superclasses of RDF4J (i.e. RDFParser, RDFParserFactory, etc. ).

  • Configure the project to work correctly with SPI.

Evaluation: This phase will compare the performance of the new parser with respect to the existing one. The main activities to be executed are:

  • Define a document dataset of RDFa.

  • Measure triples extraction time for Any23 with the existing parser.

  • Measure triples extraction time for Any23 with librdfa.

  • Compare, analyze and share the results.

Finally, it is worth to mention that every component coded during each phase will be accompanied with corresponding documentation in the Wiki [8].


Time Frame

Time Period

Expected Outcome

March 01 - April 23

Understanding the task and preparing proposal

April 24 - April 30

Community bounding

May 01 - June 10

Phase 1: Bridge.

June 11 - June 15

GSoC Evaluation 1

June 16 - June 29

Phase 2: Wrapper.

June 30 - July 8

Phase 3: Evaluation

July 9 - July 13

GSoC Evaluation 2.

July 14 - July 25

Camera-ready documentation and sharing results with the community.

July 26 - August 5

Receive feedback and fix minor issues.

August 6 - August 14

GSoC Final evaluation.


About Myself

I am Julio Caguano an undergraduate student of Computer Science at the University of Cuenca in Ecuador, I’m currently in my final year of college.

I got started into web technologies a couple years ago during my courses at the college where discovered the Linked Data and Semantic Web initiatives. Since then I have been playing with RDF and SPARQL in some of my assignments of college. I would like to deepen my knowledge in these technologies because their importance has been steadily growing in the recent years. Also, I would look forward continue studying these research areas in a postgraduate course.

I consider I have a pretty good knowledge of the Java language and related technologies (i.e. Maven, SPI). Also, I have taken some classes related to C/C++ at college and I played with this language on my own. So, I feel confident enough to work in this project and meeting the proposed goals.

My main motivation for applying this project was my background on Linked Data technologies, because I used Any23 in the past and I liked how it works. Also, I found the code pretty comprehensible and readable. On top of that, I personally always liked integration challenges I found them interesting because you have to push yourself out of your comfort zone and learn new technologies and how to interact with them.


Commitment

I estimate I could assign 25 hours per week to this project during the coding period (Including weekends and midweek free time). Nevertheless, It could be increased depending on the progress of the project or suggestions of my mentor. I would split my time into my studies and this project, which hopefully will not be a problem taking into consideration that the project will take place at the beginning of my school semester when the assignments load is small. In addition, I will be posting a weekly report on the GSoC section of the project´s wiki in order to share my progress in the planned tasks.

References

[1] https://github.com/semarglproject/semargl 

[2] http://markmail.org/thread/wn3fxkwozc3zkfqc 

[3] https://github.com/semarglproject/semargl-rdf4j 

[4] https://issues.apache.org/jira/browse/ANY23-137 

[5] https://github.com/rdfa/librdfa 

...

About MySelf

Commitment

 

References

[6] http://any23.apache.org/ 

[7] https://github.com/google/protobuf 

[8] https://cwiki.apache.org/confluence/display/ANY23/CSoC+2018 

[9] https://github.com/semarglproject/semargl 

[10] https://es.wikipedia.org/wiki/Java_Native_Interface 

Project Reports

1/6/2018

Project description

...