Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

 

...

Concepts

A Provider Module is a class module that collects a particular desired standards-compliant dataset from a non-standards-compliant data source.

A Provider Module contains provider document schemas, provider configuration schemas, provider classes, converter classes, and testing that ensures provider and converter classes continue to work as expected over time.

A Provider Document Schema is an artifact that declares the significant fields of a particular entity, event, or relationship in it's original form.

A Provider Configuration Schema is an artifact that describes how to query a particular endpoint for data of a particular type.

A Provider Class is software that can produce documents matching a Provider Document Schema, given an instance of its Provider Configuration Schema.

A Provider Converter is software that knows how to convert provider documents into activity streams documents.

Goal

The goal is to write a maven module that supports collection and conversion of a new online data source into activity streams format using Apache Streams interfaces and patterns.

...

https://git-wip-us.apache.org/repos/asf?p=incubator-streams.git;a=tree;f=streams-contrib;h=0dc5e92783e9aa634b29e11f0bbd4e97fd17d8c0;hb=HEAD

Steps

Pick a source

Decide where the data will come from.  

For example purposes this guide walks through the process of building a provider for github.com

Collect links to pertinent documentation and resources.

Find their API documentation and confirm that the entities, events, and relationships available align with the Activity Streams data model.

REST API documentation: https://developer.github.com/v3/

Figure out credentials

If the data source requires credentials, go through the process of getting credentials and document how you got them.

Create personal token: https://github.com/settings/tokens

Identify Important Entities

Go through the API documentation carefully, and make a list of the most important types of Entities, as well as at least one way those entities can be collected.

...

Identify Important

...

Event Types

Go through the API documentation carefully, and make a list of the most important types of Relationships Events that associate involve the important Entity types, as well as at least one way those relationships events can be collected.

Top Three Relationship Types:

.

Events are specific occurrences involving one or more entities, almost always with a timestamp.

List of 30 types - https://developer.github.com/v3/activity/events/types/

Top Four Event Types:

...

Document up-stream data model

For each entity, event, and relationship on the short-list, find links to documentation on exactly what the data looks like in JSON or XML form.

Identify Important Relationships

Go through the API documentation carefully, and make a list of the most important types of Events Relationships that involve associate the important Entity types, as well as at least one way those events relationships can be collected.

Activities are events involving one or more entities, usually with a timestamp.

Top Ten Event Top Four Relationship Types:List of 30 types - 

...

Find the best Activity Streams Actor or Object type for each upstream Entity type 

Identify which upstream Entity types

 

 

 

Identify which have a direct and obvious alignment with Activity Streams, and note the corresponding objectType (for entities) and verb (for events / relationships) Actor or Object type from the Activity Streams vocabulary.

Make a short list of what to collect, taking into consideration the domains already covered by the project.  It will probably look like:

  1. Page 
  2. Post
  3. Follow
  4. ...

 

Document up-stream data model

github:user -> as:Person

github:organization -> as:Organization

github:repository -> as:Page

Some of these will be obvious, some will be debatable, and some simply wont match the activity streams vocabulary at all.  That's OK.

Actor / Object Types produced by official provider modules include:

  • Person / Profile / Page 
  • Organization / Group

Find the best Activity Streams Activity or Relationship type for each upstream Event type 

Identify which upstream Event types have a direct and obvious alignment with Activity Streams, and note the corresponding Activity and/or Relationship type from the Activity Streams vocabulary

Activity Types produced by official provider modules include:

  • Post
  • Share

Find the best Activity Streams Activity or Relationship type for each upstream Relationship type 

Identify which upstream Relationship types have a direct and obvious alignment with Activity Streams, and note the corresponding Activity and/or Relationship type from the Activity Streams vocabulary

github:follow -> as:IsFollowing, as:IsFollowedBy

github:member -> as:IsMember

Relationship Types produced by official provider modules include:

  • Follow / Friend

Enumerate and Prioritize Providers

Make a short list of providers to write in the initial implementation, identifying what inputs they will require to start and what type(s) of documents they will provide.

 For each entity, event, and relationship on the short-list, find links to documentation on what the data looks like in JSON or XML form.

Look for a high-quality java library.

Search github / stack-overflow / google and see if you can find a high-quality java library to simplify the code involved in getting the data.

...

Make notes on how you plan to use the java library to get the source data on your shortlist.

 

Figure out permissions

If the data source requires special permissions to get at the dataset you are looking at, figure out how to get those permissions and document the process.

Document all the information that will be needed to connect to the data source.

Make an empty module

Create an empty module in your own project or in streams-contrib.  Make sure it's part of the reactor.

Create a base configuration object

Create a json schema file (src/main/jsonschema) with fields containing all the information needed to establish basic connectivity with the data source.

...

https://git-wip-us.apache.org/repos/asf?p=incubator-streams.git;a=blob;f=streams-contrib/streams-provider-twitter/src/main/jsonschema/com/twitter/TwitterConfiguration.json;h=69048d123022a2e138932c8a14ef9e846438bc41;hb=HEAD

Create a reference.conf

Create a  reference.conf file in src/main/resources containing a HOCON snippet matching the base configuration schema containing just the connection details.

...

https://git-wip-us.apache.org/repos/asf?p=incubator-streams.git;a=blob;f=streams-contrib/streams-provider-twitter/src/main/resources/reference.conf;h=b5c9f6f1f58ccc4c3c56e9b18dbddf42aa2d3192;hb=HEAD

Create a credential resource file for testing

Create an application.conf file containing a HOCON snippet matching the base configuration schema containing your credentials.

...

Create a unit test that demonstrates reading the test configuration resource into the configuration object

The test should demonstrate that the test resource gets loaded from the hocon snippet, into the JVM properties, then using StreamsConfigurator into an instance of the base configuration object.

Example:

TODO

Create a base provider that just opens a re-usable connection object to the data source

Create a java class which implements StreamsProvider.  

...

Appropriate validation on the configuration and on the resulting connection object should be added.

Example:

TODO

Create an integration test that demonstrates connectivity

Make a 'IT' in src/test/java that loads the test configuration with your credentials in it, instantiates a provider, and then asserts that the connection object is connected and authorized.

Example:

TODO

Create a specialized provider configuration for the profile provider

Create a second configuration bean that extends the base configuration bean we created earlier, but also has fields that specify what data should be collected.

...

Create initial provider class for collecting profiles

Create a provider class that extends the base provider (which connects but doesn't implement read methods)

...

https://git-wip-us.apache.org/repos/asf?p=incubator-streams.git;a=blob;f=streams-contrib/streams-provider-twitter/src/main/java/org/apache/streams/twitter/provider/TwitterUserInformationProvider.java;h=214d2049edd307a059a7c8800faeb6868d866d60;hb=HEAD

Identify and understand the upstream java library profile object

If using a java library, find the object in the SDK that corresponds to the profile object.

Figure out if

Add a main method to the provider

This should allow you to run the provider from the command line, with all the collected data written into a specified file.

Create an integration test for the provider that calls the main method

This simultaneously tests that data can be collected, and that the java CLI binding works.

...