Proposal: ANY23-295
Implement ability to use librdfaDescription
In 2012, the Any23 community decided to migrate from its own RDFa parser implementation to Semargl[1] as discussed in [2]. Semargl is a modular framework for crawling linked data from structured documents [9] which provides a RDFa parser compatible with RDF4J through an integration module [3]. Since that issue [4] was closed, Semargl turned into the official RDFa parser for Any23.
Lewis John McGibbney re opened the discussion proposing to test librdfa [5], a C/C++ library which claims to be ‘The fastest RDFa processor on the Internet’ and supports RDFa 1.0 and 1.1 in many varieties such XML+RDFa, XHTML+RDFa, etc. That idea was launched in order to evaluate what kind of performance boost could Any23 achieve in parsing RDFa by using a native parser and how well librdfa would integrate with Any23.
In this context, the present proposal aims to accomplish the aforementioned objective and provide an seamless integration between Any23 and librdfa parser, which allows to conduct a fair performance comparison between Semargl and librdfa within Any23.
Student
Mentor
JIRA Issue
https://issues.apache.org/jira/browse/ANY23-295
Proposal Title : Integrate and evaluate librdfa RDFa parser into Any23 via JNI (Java Native Interface) [10].
Student Name: Julio Caguano .
Student Email : julio.caguanob@gmail.com
JIRA Issues: https://issues.apache.org/jira/browse/ANY23-295 Implement ability to use librdfa
Project Deliverables
New standalone module with a new RDFa parser compatible with RDF4J using librdfa.
JNI bridge to librdfa including interfaces and middleware utilities.
Unit tests for the new librdfa module
Benchmark tests comparing Semargl and librdfa.
Self-maintaining Any23 Website documentation which will vizualize integration test results in addition to Any23 compliance against the http://rdfa.info/test-suite/
Detailed description
Scope for the project
This project will be involved in the implementation of a new RDFa parser for Any23, which serves as a wrapper for the librdfa library. The project will also include a evaluation phase for measuring the improvements or drawbacks of using such parser as the main Any23 RDFa processor.
Design
The implementation process will rely on the pre existing parsers infrastructure of Any23 which is provided by RDF4J and will use JNI as integration mechanisms for librdfa. The development of the project will be divided in three phases.
Implementation Approach
Bridge: This phase will tackle the communication issues between Any23 and librdfa and will be mainly focused on:
Loading librdfa binaries into Java.
Sending data streams from Java to librdfa (Documents’ content).
Sending parsing configurations to librdfa (ParserConfig parameters - RDF4J).
Handling and throwing exceptions.
Retrieving statements (triples) from librdfa to Java.
The Java objects to C structures and vise versa translation probably will be implemented with Protobuf [7] or similars, but it could be depend on the real issues that arise during developing time.
Wrapper: This phase will focuses on fulfilling the RDF4J interfaces in order to warranty compatibility with the existing parsers and other components of Any23. This phase will mainly deal with:
Implement the necessary superclasses of RDF4J (i.e. RDFParser, RDFParserFactory, etc. ).
Configure the project to work correctly with SPI.
Evaluation: This phase will compare the performance of the new parser with respect to the existing one. The main activities to be executed are:
Define a document dataset of RDFa.
Measure triples extraction time for Any23 with the existing parser.
Measure triples extraction time for Any23 with librdfa.
Compare, analyze and share the results.
Finally, it is worth to mention that every component coded during each phase will be accompanied with corresponding documentation in the Wiki [8].
Time Frame
Time Period | Expected Outcome |
---|---|
March 01 - April 23 | Understanding the task and preparing proposal |
April 24 - April 30 | Community bounding |
May 01 - June 10 | Phase 1: Bridge. |
June 11 - June 15 | GSoC Evaluation 1 |
June 16 - June 29 | Phase 2: Wrapper. |
June 30 - July 8 | Phase 3: Evaluation |
July 9 - July 13 | GSoC Evaluation 2. |
July 14 - July 25 | Camera-ready documentation and sharing results with the community. |
July 26 - August 5 | Receive feedback and fix minor issues. |
August 6 - August 14 | GSoC Final evaluation. |
About Myself
I am Julio Caguano an undergraduate student of Computer Science at the University of Cuenca in Ecuador, I’m currently in my final year of college.
I got started into web technologies a couple years ago during my courses at the college where discovered the Linked Data and Semantic Web initiatives. Since then I have been playing with RDF and SPARQL in some of my assignments of college. I would like to deepen my knowledge in these technologies because their importance has been steadily growing in the recent years. Also, I would look forward continue studying these research areas in a postgraduate course.
I consider I have a pretty good knowledge of the Java language and related technologies (i.e. Maven, SPI). Also, I have taken some classes related to C/C++ at college and I played with this language on my own. So, I feel confident enough to work in this project and meeting the proposed goals.
My main motivation for applying this project was my background on Linked Data technologies, because I used Any23 in the past and I liked how it works. Also, I found the code pretty comprehensible and readable. On top of that, I personally always liked integration challenges I found them interesting because you have to push yourself out of your comfort zone and learn new technologies and how to interact with them.
Commitment
I estimate I could assign 25 hours per week to this project during the coding period (Including weekends and midweek free time). Nevertheless, It could be increased depending on the progress of the project or suggestions of my mentor. I would split my time into my studies and this project, which hopefully will not be a problem taking into consideration that the project will take place at the beginning of my school semester when the assignments load is small. In addition, I will be posting a weekly report on the GSoC section of the project´s wiki in order to share my progress in the planned tasks.
References
[1] https://github.com/semarglproject/semargl
[2] http://markmail.org/thread/wn3fxkwozc3zkfqc
[3] https://github.com/semarglproject/semargl-rdf4j
[4] https://issues.apache.org/jira/browse/ANY23-137
[5] https://github.com/rdfa/librdfa
[7] https://github.com/google/protobuf
[8] https://cwiki.apache.org/confluence/display/ANY23/CSoC+2018
[9] https://github.com/semarglproject/semargl
[10] https://es.wikipedia.org/wiki/Java_Native_Interface
Project Reports
1/6/2018
Project description
Review of Previous Actions
Objectives
Future Actions
Mentors Comments