Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

To handle different file types, file analyzers will be implemented to extract the texts from it. For example, a .class file is a binary file, but the method names (mainly the ones annotated with SCA Annotations) could be extracted using Java Reflection API. File analyzers could also call other analyzers recursively, for example, an .composite file could be analyzed using a CompositeAnalyzer and when it reaches the implementation.java node it could invoke JavaClassAnalyzer and etc. This way each type of file will have only its significant text indexed, otherwise, if the file is parsed using a common text file analyzer, every search for "component" would find every composite file, because it contains "<component>" node declaration.

As suggested by Adriano Crestani at http://markmail.org/message/q7rdcs563okrnqfr, an analyzer should also be created to browse into compressed files and read all the files contained in it as if it was a regular folder.

If there is any timestamp defined in a domain artifact, for example, when a contribution was added to the domain, it could also be indexed as a date range. This would enable the user to filter the results by date range.

...

vegetables AND catalog (searches for every element that contains the word "vegetables" and "catalog", OR operation is also supported)

Wiki Markup*contribution:vegetables AND deployedDate:\[1/1/2008 to 12/31/2008\]* (searches for every contribution that contains the word "vegetables" and was deployed in 2008)

These are only some examples, many other syntaxes are supported by Lucene query parser.

Lucene query parser also supports the use of analyzer, so, the same analyzer used to parse the text before indexing should be use by the query parser.

 As proposed by Adriano Crestani at http://markmail.org/message/q7rdcs563okrnqfr, the current Lucene query parser could be extended to support new queries like:

isreferenced("StoreCatalog") (returns every artifact that references the artifact "StoreCatalog")

Many other user friendly syntax could be created to refine a query. Here is an example how it could be used:

 isreferenced(component:store) AND references(catalog) AND component:* (search for every component that is referenced by any component that starts with "store" and references any artifact that ends with "catalog")

 Queries can get really complex, so I propose to create a search webpage with advanced search options that would allow the user to easily create more sophisticated queries.

Displaying the Results

...

The results will be displayed using a tree layout, something like Eclipse IDE does \ [see image below\] on its text search results, but instead of a tree like project \ -> package \ -> class \ -> fragment text that contains the searched text, it would be, for example, node > contribution > component > file.componsite file > fragment text that contains the searched text. This is just an example, the way the results can be displayed can still be discussed on the community mailing list.!eclipse-search-example.png|align=center! &nbsp; If the results is contained inside an artifact that is text file, when the user click on the result tree leaf, the complete text file will be displayed containing the highlighted text. If it's not an text file, for example, if the result is the name of a jar file, clicking on the result tree leaf the jar file will be opened (downloaded). When displaying a result, if a file reference is in it, it could be displayed as a link to the file, as suggested by Luciano Resende at [list.Image Added

 
If the results is contained inside an artifact that is text file, when the user click on the result tree leaf, the complete text file will be displayed containing the highlighted text. If it's not an text file, for example, if the result is the name of a jar file, clicking on the result tree leaf the jar file will be opened (downloaded). Also, on compressed files, an option to expand the tree down into this files and browse through it .

When displaying a result, if a file reference is in it, it could be displayed as a link to the file, as suggested by Luciano Resende at http://markmail.org/message/ugldeod4u54nknz5].
By default, the results returned by Lucene are sorted by a score that is assigned to each result using a predefined heuristic, this is considered to be one of the best ways to show the results, because it considers the word frequency in the index and some other variables to define if a result is relevant or not. However, the way the results are sorted and scored also can be discussed on the community mailing list.

Integration

As already mentioned, the indexing/searching functionalities will be exposed as an SCA Component. The SCA Domain Manager will use this component to index/search everything in the domain. A ContributionListener will be used to listen every change in the domain, when a contribution is added/updated/removed, and update the index to reflect the modification.

The index will be opened when the SCA Domain manager is initialized and if it does not exist it will be created from every artifact already contained in the domain. An option should be available on the UI to reindex all the domain, it may be useful when the domain is changed when the SCA Domain Manager is not running and not able to request the search component to update the index.

Additional Information

I intend to work on this project 5 days/week, at least 6 hours a day.

...