Proposal: ANY23-249 Update all W3C and other Standards Compliance within Any23 

Description

Michele Mostarda and Lewis John McGibbney have been discussing what would work well for an Any23 Google Summer of Code Project for 2015.

It turns out that in order to rebuild confidence with the Any23 standards compliance (in light of new W3c standards which may be emerged or advanced) an in light of new non-W3C emerging standards such as microformats2 it would be a very worthwhile effort to have one or more student(s) engage on

  • initially evaluating all of the existing standards compliance within Any23
  • uncovering which aspects of the compiled list require attention e.g. updating, overhauling, re-implementation, extension or otherwise
  • progress on executing the above under supervision of one or more of the assigned mentors

Student

Nisala Mendis

Mentor(s)

Lewis John McGibbney, Michele Mostarda

JIRA Issue

https://issues.apache.org/jira/browse/ANY23-249

Full Proposal

Proposal Title : Microformats2 Support for Any23

Student Name: Nisala Mendis

Student Email : nisala12@gmail.com

JIRA Issues: https://issues.apache.org/jira/browse/ANY23-249 : Update all W3C and other Standards Compliance within Any23, https://issues.apache.org/jira/browse/ANY23-207 : Engage with Microformats2

Project Deliverables

Microformats2 extractors for Any23 core

Extensions 2 Any23 CLI tool supporting Microformats2

Extensions 2 Any23 REST API supporting Microformats2

JUnit Test cases for Microformats2 extractors

Detailed description

Anything To Triples (Any23) is a library, a web service and a command line tool that extracts structured data consists in web documents supporting many different input formats. ( RDF, microdata, microformats etc ). The extracted data can be converted to different formats. ( XML, JSON etc. ) The project is related to standard compliance related to different formats based on associated W3C or non-W3C standard latest specification and updating implementations of those standard specifications to latest and stable ones. Some of the original implementations of formats based standard specifications are mostly outdated by more recent ones.

eg.

Microdata extractor implementation is based on the draft specification in [1] but the latest published version is available in [2]

Microformats extractors implementations are based on the specification in [3]  latest published version is available in [4]

Scope for the project

This project will directly involve implementation for support microformats2 specification [1] in Any23. Original microformats support should be retained separately. As GSoC is just limited to 12 weeks of time, I have prepared milestone plan for implementation only for the microformats2 extractors for any23 library. But  depending on the time availability I have selected and it will be more appropriated to update the microdata extractors to 2013 specification in [3].

Design

Implementation can be carried using two different ways.

  1. reusing existing microformats2 library
  2. extending the current parser/extractors of microformat in any23 to microformat2 ( also note here that we'll retain the original microformats support, extensions are done separately )

but according to [5] there is no native java library available parsing, so feasibility of adapting such library and evaluating them and selecting the appropriate one remain concern. But when considering the two microformats specification, while critically evaluating two, it is suggesting that implementation can be done as extensions relatively easily to parser implementation available for microformats. Any23 already supports microformats, and have a parser (TagSoupParser - HTML DOM parser) and set of Extractors for each microformat such as AdrExtractor , GeoExtractor etc. Considering the fact that existing parser can be reused in the case microformats2, implementation is pretty straightforward and this will involve extractor class implementation for each and every microformat defined in microformat2 specification. ( microformats2 parsing specification in parsing specification and V2 vocabularies in [4] )

  1. h-adr
  2. h-card
  3. h-entry
  4. h-event
  5. h-feed
  6. h-geo
  7. h-item
  8. h-listing
  9. h-product
  10. h-recipe
  11. h-resume
  12. h-review
  13. h-review-aggregate

Note some of the microformats are direct updates to previous defined microformats in previous specification.

eg: h-adr is the microformats-2 update to adr.

Even though there are similarities the new Extractor classes will directly follow the microformats2 parsing specification in [4] because we are retaining two versions I suggest creating new package microformats2 inside org.apache.any23.extractor.html or org.apache.any23.extractor. ( any23 core maven module ) That way it is more clean to maintain both microformat and microformat2 support separately.

CLI interface for microformats 2 will remain similar to the microformats2 as for the microformats. Necessary identification among two is needed when separating functionalities since we are retaining the original microformat with the microformat2. Both will be existed in the system.

so for example when separating the functionality of a extractor between 2 versions,

microformat extractor : html-mf-adr

microformat2 extractor : html-mf2-hadr

note: the added mf2 for identification of microformats-2 from microformats-1.

REST service extensions will be pretty much similarly implemented for the microformats2 case as what is already implemented for the microformats as we are retaining original supports similar measures will be taken when separating functionalities between those two versions.

Implementation Approach

Phase 1

Extractor  class implementations for each microformat defined in microformats2.

Eg:

h-adr : HadrExtractor

h-card : HcardExtractor

h-entry : HentryExtractor

h-event : HeventExtractor

Phase 2

CLI and REST service extensions to microformats2.

Phase 3

Documentation and Junit test cases. Similar test cases will be added.

Eg:

HadrExtractorTest

HReviewExtractorTest

HResumeExtractorTest

Phase 4 ( Extended Phase )

Update MicrodataParser and MicrodataExtractor in any23 core according to updated standard specification. [2]

Time Frame

Time Period

Expected Outcome

Community Bonding Period

Getting familiar with Any23 Core,CLI and Service package codebases, Microformats2 parser specification

May-25 to May-June 10

Phase 1: Extractor classes related to

h-adr, h-card, h-entry, h-event microformats

June 10 to June 24

Phase 1: Extractor classes related to h-feed, h-geo, h-item, h-listing microformats

June 24 to July 8

Phase 1:  Extractor classes related to h-product, h-recipe, h-resume, h-review, h-review-aggregate microformats

July8 to August 1

Phase 2

August 2 to August 21

Phase 3

Extended Scope

Time Period

Expected Outcome

depend on the time

availability

Phase 4                                                                                                                                                 

About MySelf

I am Nisala Nirmana. I am currently a postgraduate student at University of Colombo and partly work as research engineer at Mobile communication research laboratory in University of Moratuwa, Sri Lanka. I am a past graduate student of Curtin University Perth Australia specialized in Computer Systems and Networking, and past graduate student of University of Kelaniya, Sri Lanka specialized in Computer Science and Electronics.

I find myself as an open source enthusiast and I use many of open source products for my personal use and for academic purposes. Research areas that I’m interested are Cloud computing, SOA, Web services.

I have successfully completed projects related to to web services and server side network programming on Linux environment. Technologies used were Apache Axis and Axis 2 SOAP web services framework and Java spring REST web services framework. I am proficient mainly JAVA language, build tools such as Apache Maven version controlling in Git. I am confidant  that I have necessary capabilities to successfully complete this project.

The motivation for applying Apache any23 form others, it is a prominent open source project from Apache, and this project directly involves well known open standard specification implementation such as microfromats2 in any23 core libraries.

Commitment

I will be able to allocate more than 45 hours of work for this project in the coding period. It is worth to mention that I wont be engaged any working activities during this period except my studies, I can fully concentrate on the project itself. Also i will be able to start implementation of this project within the community bonding so that portion of this be covered so that, project will be successfully covered within the coding period.

References

  1. http://www.w3.org/TR/2011/WD-microdata-20110525//
  2. http://www.w3.org/TR/2013/NOTE-microdata-20131029/
  3. http://microformats.org/wiki/Main_Page
  4. http://microformats.org/wiki/microformats-2
  5. http://microformats.org/wiki/parsers#microformats2_parsers
  6. https://github.com/nisalanirmana/

Project Reports

1/6/2015

Project description 

Unable to render Jira issues macro, execution error. worked with missing meta elements to html meta extractor. Attached patch submitted for review.

Review of Previous Actions

N/A

Objectives

Currently HTMLMetaExtractor extracts only name meta element attribute. Extended further to include http-quiv and charset.

Future Actions

Concerns were raised on including itemProp attribute.

Mentors Comments

The work so far has essentially addressed one JIRA issue. Although we are off to a slow start the most important this is that we are moving.

As mid term reporting will be upon us sooner than later, we need to start documenting the core issue here which is the implementation of Microformats2. I will be looking for Nisala Mendis to provide detailed updates on how Microformats2 support should be implemented, what the key differences are and then progress with the implementation.

Signed: Lewis McGibbney

 

 

8/6/2015

Project description 

Detailed analysis of microformats 1 and microformats 2 and created a report for implementation. As this would be a key for the as we would be reusing the existing support and we would be implementing changes as extensions. Key differences were noted and Report was submitted for mentors review.

Review of Previous Actions

N/A

Objectives

Analyzed current extractor implementations. Identified the areas needed to be extended in order to include extractors of microformats 2.

Future Actions
Mentors Comments

 

16/6/2015

Project description 

Started the implementation with HAdr and HGeo microformat extractors. One of the key changes other than the class names and property names is the HAdr can have a HGeo as a nested property. ( These two microformats was not not specified to be nested inside according to microformats 1 spec, also there is a implementation difference between nesting as a property vs nesting two microformats )

Review of Previous Actions

N/Ae

Objectives

I created a separate package for microformats 2 extractors. As this was one of the noted requirement since as we want to retain original microformats 1 support. Added Junit test cases for special cases.

Future Actions
Mentors Comments

 

24/6/2015

Project description 

I made changes according to mentor feedback and added further JUnit test cases. I further started implementing the rest of the microformats from the spec.

Review of Previous Actions

N/Ae

Objectives

Complete the rest of the microformats implemented.

Future Actions
Mentors Comments

 

MidTerm

7/7/2015

Project description 

Implementing rest of the microformats according to the mentors feedback.

Review of Previous Actions
Objectives

Implemented H-item and H-recipe with some sample test cases. see commit https://github.com/nisalanirmana/any23/commit/1616c17cb6497bcdf7947ee1048027f1b6d83a9f

Future Actions

complete the embedded properties of other nested microformats.

 

14/7/2015

Project description 

Implementing rest of the microformats according to the mentors feedback.

Review of Previous Actions
Objectives

Implemented H-product and H-event with some sample test cases. https://github.com/nisalanirmana/any23/commit/cc0dfbe8127a00fa712c7d2df6785a73c290feae

Future Actions

complete the embedded properties of other nested microformats.

 

28/7/2015

Project description 

Implementing rest of the microformats according to the mentors feedback. Included embedded properties implementation.

Review of Previous Actions
Objectives

Implemented H-entry and H-resume with some sample test cases. https://github.com/nisalanirmana/any23/commit/417b71a757ecb444a98cebeb25f48faa1c27524f

Future Actions

complete the Hcard extractor and fix the dependencies it has with others.

 

5/8/2015

Project description 

Implementing the Hcard extractor and completing all the toDO cases where Hcard is used as a embedded property in the already implemented microformats. Test case should be written for cases such as Hentry microformat can have author embeded property which can have

HCard extractor.

Review of Previous Actions
Objectives

Implemented HCard and completed all the cases where other microformats have dependency to HCard as a embedded property. Added test cases for HCard extraction and HEntry ( HCard embeded property ) comited in https://github.com/nisalanirmana/any23/commit/cf48a5bf88b40bc327108a4daa857e14d914d654

Future Actions

Completing tasks in the code reviews given by the mentor Lewis McGibbney. Further commit the changes related to HReview extractor and test cases and finalize the work and documentation.

  • No labels

4 Comments

  1. Nisala Mendis thank you for updating the license headers.

    Do you have a timeline on when the other issues can be updated? e.g. Further commit the changes related to HReview extractor and test cases and finalize the work and documentation.

    It would also be very nice to quickly update the documentation on the Any23 website with the addition of Microformats2.

    Finally, it would be most appreciated if you could head over to the Microformats community and advertise your work to them. I am sure they will be most interested in your work (smile)

    1. Hi Lewis,

      Lewis John McGibbney I am currently working on to push the changes related to HReview it will take me couple of hours to push those changes as I am currently testing them. With related to the documentation ( Any23 website ) I will work on it after pushing those changes. Can you please tell me how I can proceed with adding the documentation to Any23 web site?

      Thanks for the comments Lewis, highly appreciate it Smile (smile) I will work on the comments you posted.

      Regards
      Nisala

      1. Can you please tell me how I can proceed with adding the documentation to Any23 web site?

        A tutorial for updating the website content can be found at

        Building Apache Any23 Website HOW_TO

        If you send in a patch for your website updates then I will make sure that they make it on to the website (smile)

         

      2. Hi Nisala Mendis did you manage to complete your updates to the pull request? Can you please submit them and we can get this code committed and the issue closed off? Thanks