You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 10 Next »

 

Background

A Provider is a class that collects a particular desired dataset from a data source.

Examples

Official Providers reside in the streams-contrib directory of https://github.com/apache/incubator-streams and are named streams-provider-*

https://git-wip-us.apache.org/repos/asf?p=incubator-streams.git;a=tree;f=streams-contrib;h=0dc5e92783e9aa634b29e11f0bbd4e97fd17d8c0;hb=HEAD

Steps

Pick a source

Decide where the data will come from.  Find their API documentation and confirm that the entities, events, and relationships available align with the Activity Streams data model.

Figure out credentials

If the data source requires credentials, go through the process of getting credentials and document how you got them.

 

Prioritize what to collect

Go through the API documentation carefully, and make a list of the types of:

  • Entities - people, places, things
  • Events - stuff that happened or is planned to happen, involving one or more entities, with a timestamp
  • Relationships - how entities are connected, timestamp optional

Identify which have a direct and obvious alignment with Activity Streams, and note the corresponding objectType (for entities) and verb (for events / relationships) from the Activity Streams vocabulary.

Make a short list of what to collect, taking into consideration the domains already covered by the project.  It will probably look like:

  1. Page 
  2. Post
  3. Follow
  4. ...

 

Document up-stream data model

For each entity, event, and relationship on the short-list, find links to documentation on what the data looks like in JSON or XML form.

Look for a high-quality java library.

Search github / stack-overflow / google and see if you can find a high-quality java library to simplify the code involved in getting the data.

It should be:

  1. FOSS-friendly license (Apache 2.0, MIT, etc...)
  2. In maven central (findable with search.maven.org)
  3. Active (have at least one release in the prior 12 months)

Make notes on how you plan to use the java library to get the source data on your shortlist.

 

Figure out permissions

If the data source requires special permissions to get at the dataset you are looking at, figure out how to get those permissions and document the process.

Document all the information that will be needed to connect to the data source.

Make an empty module

Create an empty module in your own project or in streams-contrib.  Make sure it's part of the reactor.

Create a base configuration object

Create a json schema file (src/main/jsonschema) with fields containing all the information needed to establish basic connectivity with the data source.

These fields should include:

  • everything needed to connect
  • everything needed to authenticate

Example:

https://git-wip-us.apache.org/repos/asf?p=incubator-streams.git;a=blob;f=streams-contrib/streams-provider-twitter/src/main/jsonschema/com/twitter/TwitterConfiguration.json;h=69048d123022a2e138932c8a14ef9e846438bc41;hb=HEAD

Create a reference.conf

Create a  reference.conf file in src/main/resources containing a HOCON snippet matching the base configuration schema containing just the connection details.

This file should contain only the connection details, no credentials.

By putting these in reference.conf, you ensure that they get set by default for anyone who uses the module, thus relieving you of needed to bake default values into either the code or the json schemas.

Example:

https://git-wip-us.apache.org/repos/asf?p=incubator-streams.git;a=blob;f=streams-contrib/streams-provider-twitter/src/main/resources/reference.conf;h=b5c9f6f1f58ccc4c3c56e9b18dbddf42aa2d3192;hb=HEAD

Create a credential resource file for testing

Create an application.conf file containing a HOCON snippet matching the base configuration schema containing your credentials.

This file should contain only your credentials - but you only need one credential file for every provider you are working with.

Example:

Create a unit test that demonstrates reading the test configuration resource into the configuration object

The test should demonstrate that the test resource gets loaded from the hocon snippet, into the JVM properties, then using StreamsConfigurator into an instance of the base configuration object.

Example:

TODO

Create a base provider that just opens a re-usable connection object to the data source

Create a java class which implements StreamsProvider.  

This provider doesn't need to implement any of the read* methods, just prepare.  Calling prepare should result in a Provider with a live connection to the data source.  

Appropriate validation on the configuration and on the resulting connection object should be added.

Example:

TODO

Create an integration test that demonstrates connectivity

Make a 'IT' in src/test/java that loads the test configuration with your credentials in it, instantiates a provider, and then asserts that the connection object is connected and authorized.

Example:

TODO

Create a specialized provider configuration for the profile provider

Create a second configuration bean that extends the base configuration bean we created earlier, but also has fields that specify what data should be collected.


Example:

Create initial provider class for collecting profiles

Create a provider class that extends the base provider (which connects but doesn't implement read methods)

First provider will take a list of IDs, and get the current profile for each.

Implement the startStream method on this provider.  startStream should create and queue threads to bring data into the class.

Implement the readCurrent method on this provider.  readCurrent should pass collected data in a StreamsResultSet to the caller exactly once.

Example:

https://git-wip-us.apache.org/repos/asf?p=incubator-streams.git;a=blob;f=streams-contrib/streams-provider-twitter/src/main/java/org/apache/streams/twitter/provider/TwitterUserInformationProvider.java;h=214d2049edd307a059a7c8800faeb6868d866d60;hb=HEAD

Identify and understand the upstream java library profile object

If using a java library, find the object in the SDK that corresponds to the profile object.

Figure out if

Add a main method to the provider

This should allow you to run the provider from the command line, with all the collected data written into a specified file.

Create an integration test for the provider that calls the main method

This simultaneously tests that data can be collected, and that the java CLI binding works.

As a side effect, collected data gets placed in target/test-classes and can then be used to test conversion

Example:

https://git-wip-us.apache.org/repos/asf?p=incubator-streams.git;a=blob;f=streams-contrib/streams-provider-twitter/src/test/java/org/apache/streams/twitter/test/providers/TwitterUserInformationProviderIT.java;h=f3c31973ee82d3b0182b172058865e43b93dc7bd;hb=HEAD

Implement a 

 

  • No labels