Concepts
A Provider Module is a module that collects a standards-compliant dataset from a non-standards-compliant data source.
A Provider Module contains provider document schemas, provider configuration schemas, provider classes, converter classes, and testing that ensures provider and converter classes continue to work as expected over time.
A Provider Document Schema is an artifact that declares the significant fields of a particular entity, event, or relationship in it's original form.
A Provider Configuration Schema is an artifact that describes how to query a particular endpoint for data of a particular type.
A Provider Class is software that can produce documents matching a Provider Document Schema, given an instance of its Provider Configuration Schema.
A Provider Converter is software that knows how to convert provider documents into activity streams documents.
Goal
The goal is to write a maven module that supports collection and conversion of a new online data source into activity streams format using Apache Streams interfaces and patterns.
Official Providers reside in the streams-contrib directory of streams-project, are named org.apache.streams:streams-provider-*, and are released with the rest of streams.
Unofficial Providers can be built and published under any groupId but the artifactId should still be named streams-provider-* and the code should be built with Apache Streams interfaces and patterns.
https://github.com/apache/incubator-streams
Steps
Pick a source
Decide where the data will come from.
For example purposes this guide walks through the process of building a provider for github.com
Collect links to pertinent documentation and resources.
Find their API documentation and confirm that the entities, events, and relationships available align with the Activity Streams data model.
REST API documentation: https://developer.github.com/v3/
Figure out credentials
If the data source requires credentials, go through the process of getting credentials and document how you got them.
Create personal token: https://github.com/settings/tokens
Identify Important Entities
Go through the API documentation carefully, and make a list of the most important types of Entities, as well as at least one way those entities can be collected.
Entities are typically people, places, things, sometimes abstract things such as ideas or concepts.
The API will typically allow the caller to get a list of entities of a specific type.
Top Three GitHub Entity Types:
- Organization
- Repository
- User
Identify Important Event Types
Go through the API documentation carefully, and make a list of the most important types of Events that involve the important Entity types, as well as at least one way those events can be collected.
Events are specific occurrences involving one or more entities, almost always with a timestamp.
List of 30 types - https://developer.github.com/v3/activity/events/types/
- https://developer.github.com/v3/activity/events/#list-events-for-an-organization
- https://developer.github.com/v3/activity/events/#list-events-performed-by-a-user
- https://developer.github.com/v3/activity/events/#list-public-events-performed-by-a-user
- https://developer.github.com/v3/activity/events/#list-events-for-an-organization
Top Four Event Types:
- MemberEvent - https://developer.github.com/v3/activity/events/types/#memberevent
- OrganizationEvent - https://developer.github.com/v3/activity/events/types/#organizationevent
- RepositoryEvent - https://developer.github.com/v3/activity/events/types/#repositoryevent
- WatchEvent - https://developer.github.com/v3/activity/events/types/#watchevent
Identify Important Relationships
Go through the API documentation carefully, and make a list of the most important types of Relationships that associate the important Entity types, as well at least one way those relationships can be collected.
Top Four Relationship Types:
Users are Members of Organizations
- Repositories belong to Organizations
Users contribute to repositories
Users follow other users
Document up-stream data model
For each entity, event, and relationship on the short-list, find links to documentation on exactly what the data looks like in JSON or XML form.
Find the best Activity Streams Actor or Object type for each upstream Entity type
Identify which upstream Entity types have a direct and obvious alignment with Activity Streams, and note the corresponding Actor or Object type from the Activity Streams vocabulary.
github:organization -> as:Organization
Some of these will be obvious, some will be debatable, and some simply wont match the activity streams vocabulary at all. That's OK.
Actor / Object Types produced by official provider modules include:
- Person / Profile / Page
- Organization / Group
Find the best Activity Streams Activity or Relationship type for each upstream Event type
Identify which upstream Event types have a direct and obvious alignment with Activity Streams, and note the corresponding Activity and/or Relationship type from the Activity Streams vocabulary
Activity Types produced by official provider modules include:
- Post
- Share
Find the best Activity Streams Activity or Relationship type for each upstream Relationship type
Identify which upstream Relationship types have a direct and obvious alignment with Activity Streams, and note the corresponding Activity and/or Relationship type from the Activity Streams vocabulary
github:follow -> as:IsFollowing, as:IsFollowedBy
github:member -> as:IsMember
Relationship Types produced by official provider modules include:
- Follow / Friend
Enumerate and Prioritize Providers
Make a short list of providers to write in the initial implementation, describing the use case they support, identifying what inputs they will require to start and what type(s) of documents they will provide.
- OrganizationProvider
- Use Case: Given a finite list of organization IDs, pull latest details about each.
- Input: List of organization ids
- Output: github:organization
- Use Case: Given a finite list of user IDs, pull latest details on each organization any of them belong to.
- Input: List of user ids
- Output: github:organization
- Use Case: Given a finite list of organization IDs, pull latest details about each.
- RepositoryProvider
- Use Case: Given a finite list of organization IDs, pull latest details on each public repository belonging to any of them.
- Input: List of organization ids
- Output: github:repository
- Use Case: Given a finite list of user IDs, pull latest details on each public repository belonging to any of them.
- Input: List of organization ids
- Output: github:repository
- Use Case: Given a finite list of organization IDs, pull latest details on each public repository belonging to any of them.
- UserProvider
- Use Case: Given a finite list of user IDs, pull latest details on all of them.
- Input: List of user ids
- Output: github:user
- Use Case: Given a finite list of user IDs, pull latest details on all of them.
- UserFollowingProvider
- Use Case: Given a finite list of user IDs, pull latest details on each user any of them are following, maintaining the follow connection.
- Input: List of user ids
- Output: github:follow (github:user, github:user)
- Use Case: Given a finite list of user IDs, pull latest details on each user any of them are following, maintaining the follow connection.
- Input: List of user ids
- Output: github:follow (github:user, github:user)
- Use Case: Given a finite list of user IDs, pull latest details on each user any of them are following, maintaining the follow connection.
Make an empty module
Create an empty module in your own project or in streams-contrib. Make sure it's part of the reactor.
Create a base configuration object
Create a json schema file (src/main/jsonschema) with fields containing all the information needed to establish basic connectivity with the data source.
These fields should include:
- everything needed to connect
- everything needed to authenticate
Example:
Create a reference.conf
Create a reference.conf file in src/main/resources containing a HOCON snippet matching the base configuration schema containing just the connection details.
This file should contain only the connection details, no credentials.
By putting these in reference.conf, you ensure that they get set by default for anyone who uses the module, thus relieving you of needed to bake default values into either the code or the json schemas.
Example:
Create a credential resource file for testing
Create an application.conf file containing a HOCON snippet matching the base configuration schema containing your credentials.
This file should contain only your credentials - but you only need one credential file for every provider you are working with.
Example:
Create a unit test that demonstrates reading the test configuration resource into the configuration object
The test should demonstrate that the test resource gets loaded from the hocon snippet, into the JVM properties, then using StreamsConfigurator into an instance of the base configuration object.
Example:
TODO
Write a primary class to manage the HTTP connections and implement accessor methods.
Give it a singleton getInstance method driven from the configuration object
Example:
org.apache.streams.twitter.api.Twitter
Create an integration test that demonstrates basic connectivity
Make a 'IT' in src/test/java that loads the test configuration with your credentials in it, instantiates the primary connection class, asserts that the connection object is instantiated and authorized.
Example:
Authentication
Find request signing documentation
Most APIs require requests to be cryptographically signed. The exact details and protocols may differ.
Figure out permissions
If the data source requires special permissions to get at the dataset you are looking at, figure out how to get those permissions and document the process.
Document all the information that will be needed to connect to the data source.
Integration
Our goal is to create interfaces that let us access important entities, events, and relationships from the data provider in their native format, via java objects generated from schemas.
Create at least one java interface to wrap the data provider
REST interfaces typically have a tree structure:
- http://api.github.com/
- orgs/
- repos/
- users/
A call to the interface will typically contain:
- a path, which might contain path parameters,
- a set of query parameters
- and/or a request entity
Typically, for each path we want to call we will create a java method on one of several interfaces, enumerate the request parameters, describe the request as a java bean, and describe the response as a java bean.
Example:
Create an integration test that demonstrates connectivity
Make a 'IT' in src/test/java that loads the test configuration with your credentials in it, instatiates the primary class, and then tests that the connection object is connected and authorized.
Example:
TODO
Create a specialized provider configuration for the profile provider
Create a second configuration bean that extends the base configuration bean we created earlier, but also has fields that specify what data should be collected.
Example:
Create initial provider class for collecting profiles
Create a provider class that extends the base provider (which connects but doesn't implement read methods)
First provider will take a list of IDs, and get the current profile for each.
Implement the startStream method on this provider. startStream should create and queue threads to bring data into the class.
Implement the readCurrent method on this provider. readCurrent should pass collected data in a StreamsResultSet to the caller exactly once.
Example:
Identify and understand the upstream java library profile object
If using a java library, find the object in the SDK that corresponds to the profile object.
Figure out if
Add a main method to the provider
This should allow you to run the provider from the command line, with all the collected data written into a specified file.
Create an integration test for the provider that calls the main method
This simultaneously tests that data can be collected, and that the java CLI binding works.
As a side effect, collected data gets placed in target/test-classes and can then be used to test conversion
Example:
Implement a