Phillipe Ramalho Proposal - Add search capability to index and search artifacts in the SCA domain

Title/Summary: Add search capability to index/search artifacts in the SCA domain, including the contributions, WSDL/XSDs, java files, composite files

Student: Phillipe Ramalho

Student e-mail: phillipe.ramalho@gmail.com

Student Major: Computer Science

Student Degree: Undergraduate

Student Graduation: 2012

Organization: The Apache Software Foundation

Assigned Mentor:

Abstract:

The SCA Domain Manager module provides a web application that allows the domain administrator to browse through the domain and also add/remove contributions/components/nodes from it. The goal of this project is to implement a search functionality for this module so the administrator can easily search for the artifacts contained in a domain. Every text that can be searched on a SCA Domain will be extracted via introspection and indexed using a search engine, the index will be updated every time the domain is modified. The search engine chosen is Apache Lucene, because it provides a complete set of features to perform text index and search.

Detailed Description

The goal of this project is to implement a search functionality for the SCA Domain Administration module so the domain administrator can easily search for the artifacts contained in that domain. The module already provides the functionality to browse almost every artifact contained in the domain, but it does not provide a feature to search for an specific artifact by the name or a text contained in it. So, implementing a search functionality would allow a user to quickly find an specific artifact in the domain.

Implementation

The search functionality will be implemented using Java language and a search engine written for this language. The implementation will be exposed as an SCA component and will be used by the SCA Domain Manager web application.

The search engine chosen is Apache Lucene, because it provides a complete set of features to perform text index, search and result displaying.

Indexing

Every text that can be searched on a SCA Domain will be extracted via introspection and indexed using Lucene, the index will be updated every time the domain is modified. Lucene supports indexing per field, it means that a text can be associated to an specific field, so, for example, the component "VegetablesCalalog" could
be associated with the field "component". Field association helps when searching only for an specific type of artifact, like a component or a contribution.

Each indexed text will be indexed with the artifact it is related to. The artifact can be identified by an artifactID that is assigned to each artifact at indexing time. The artifactID could be an URI that is unique in the entire domain or an generated unique number, another discussion about this can be raised on the community.

Also, every time a new artifactID is created and indexed, it would be indexed with the domain artifact where it was found, for example, if the artifactID represents a component, a list of contributions that uses this component are indexed with it. But it is not limited only to contributions, it can also be anything that references that component or this component uses, like an XML file or any other artifact. In addition, with every artifact the indexed artifact is related to, an extra information can be added using a Lucene feature called payload, this information could tell what is the relationship between the elements. It's basically like defining a directed graph about the SCA domain in the Lucene index.

For text analyzing, Lucene have a set of analyzers that can be used to tokenize, stem, lowercase and etc, they basically parse the text before it's indexed. There are analyzers for many different languages already defined on Lucene, but I propose to implement a different analyzer, which should be used to analyze every text defined in a domain. For example, the text "VegetablesCatalog" should preferably be tokenized into two different terms, "Vegetables" and Catalog", and indexed separately than all together, increasing the search response when the user types only "vegetables". For example, if an analyzer designed to parse English texts was used, it would not be able to split "VegetablesCatalog" into two different words. Lowercasing should be part of this analyzer as well. So, in this project an analyzer will be implemented to identify words separated by an uppercase letter, as this is a known pattern when writing element names in IT universe. Other naming patterns may be also identified during the development and implemented in this new analyzer.

To handle different file types, file analyzers will be implemented to extract the texts from it. For example, a .class file is a binary file, but the method names (mainly the ones annotated with SCA Annotations) could be extracted using Java Reflection API. File analyzers could also call other analyzers recursively, for example, an .composite file could be analyzed using a CompositeAnalyzer and when it reaches the implementation.java node it could invoke JavaClassAnalyzer and etc. This way each type of file will have only its significant text indexed, otherwise, if the file is parsed using a common text file analyzer, every search for "component" would find every composite file, because it contains "<component>" node declaration.

As suggested by Adriano Crestani at http://markmail.org/message/q7rdcs563okrnqfr, an analyzer should also be created to browse into compressed files and read all the files contained in it as if it was a regular folder.

If there is any timestamp defined in a domain artifact, for example, when a contribution was added to the domain, it could also be indexed as a date range. This would enable the user to filter the results by date range.

Searching

The user will enter a query on a text field available on the SCA Domain Manager web application and the application will return the results. For now, the search text field is only available on domain manager "home" session, I suggest it to be extended to toolbar-gadget, so it can be easily accessed by the user, avoiding him/her to always go to "home" session to search for something.

The searched text query must be parsed before executed and Lucene has a complete set of classes to do this job. It supports most of query syntax used by search engines like Google. Below are some examples the queries the user could type when the Lucene query parser syntax is used:

vegetables (a single term search)

Vegetables (also a single term search, equivalent to first example after it's lowercased by the query parser)

veg* (searches for every term that starts with "veg")

*tables (searches for every term that ends with "tables")

*egta* (searches for every term that contains "egta")

+vegetables -catalog (searches for every artifact that contains the word "vegetables", but does not contain the word "catalog")

+vegetables catalog (searches for every artifact that contains the word "vegetables" and may contain the word "catalog", on this case results that contain both words are scored higher by the search engine)

vegetables AND catalog (searches for every element that contains the word "vegetables" and "catalog", OR operation is also supported)

contribution:vegetables AND deployedDate:[1/1/2008 to 12/31/2008] (searches for every contribution that contains the word "vegetables" and was deployed in 2008)

These are only some examples, many other syntaxes are supported by Lucene query parser.

Lucene query parser also supports the use of analyzer, so, the same analyzer used to parse the text before indexing should be use by the query parser.

As proposed by Adriano Crestani at http://markmail.org/message/q7rdcs563okrnqfr, the current Lucene query parser could be extended to support new queries like:

isreferenced("StoreCatalog") (returns every artifact that references the artifact "StoreCatalog")

Many other user friendly syntax could be created to refine a query. Here is an example how it could be used:

isreferenced(component:store) AND references(catalog) AND component:* (search for every component that is referenced by any component that starts with "store" and references any artifact that ends with "catalog")

Queries can get really complex, so I propose to create a search webpage with advanced search options that would allow the user to easily create more sophisticated queries.

Displaying the Results

The results will be displayed using a tree layout, something like Eclipse IDE does [see image below] on its text search results, but instead of a tree like project -> package -> class -> fragment text that contains the searched text, it would be, for example, node > contribution > component > file.componsite file > fragment text that contains the searched text. This is just an example, the way the results can be displayed can still be discussed on the community mailing list.

If the results is contained inside an artifact that is text file, when the user click on the result tree leaf, the complete text file will be displayed containing the highlighted text. If it's not an text file, for example, if the result is the name of a jar file, clicking on the result tree leaf the jar file will be opened (downloaded). Also, on compressed files, an option to expand the tree down into this files and browse through it .

When displaying a result, if a file reference is in it, it could be displayed as a link to the file, as suggested by Luciano Resende at http://markmail.org/message/ugldeod4u54nknz5.
By default, the results returned by Lucene are sorted by a score that is assigned to each result using a predefined heuristic, this is considered to be one of the best ways to show the results, because it considers the word frequency in the index and some other variables to define if a result is relevant or not. However, the way the results are sorted and scored also can be discussed on the community mailing list.

Integration

As already mentioned, the indexing/searching functionalities will be exposed as an SCA Component. The SCA Domain Manager will use this component to index/search everything in the domain. A ContributionListener will be used to listen every change in the domain, when a contribution is added/updated/removed, and update the index to reflect the modification.

The index will be opened when the SCA Domain manager is initialized and if it does not exist it will be created from every artifact already contained in the domain. An option should be available on the UI to reindex all the domain, it may be useful when the domain is changed when the SCA Domain Manager is not running and not able to request the search component to update the index.

Additional Information

I intend to work on this project 5 days/week, at least 6 hours a day.

Chronogram

April 20 - 27: getting familiar with Tuscay SCA code and setting up environment

April 28 - May 26: Implementing text and file analyzer for indexing

May 27 - June 10: Implementing indexing phase

May 11 - June 18: Implementing searching phase

June19 - July 3: Implementing results tree

June 4 - July 18: Implementing full result display

July 13: Mid-term evaluation

July 29 - August 5: Implementing results highlighting

August 6 - August 10: Performance adjustments

August 11 - 16: Documenting and writing test cases

August 17: Final evaluation

Child pages