Usage server improvements

PR

TBD

Introduction

During the past few years some issues were observed with the usage server. Most of these issues were related to limitations and bugs on it. For example, the usage server stops working and cannot be initiate it back again to work if some duplicate entries are detected for events, needing manual intervention.

It has also been observed that the performance of the current usage server implementation starts getting degraded once the database starts growing in size with shorter aggregation times, such as hourly. When this happens, the execution times of the usage jobs exceed the aggregation time and a delay is observed on the date of the generated usage records.

Purpose

A major refactor is proposed on the usage server to increase the usage server performance

Document History

Version	Author/Reviewer	Date
1.0	Nicolas Vazquez	21 Jun 2021

Feature Specifications

This feature proposes a refactor on the usage server ensuring that the following requirements are met for administrators:

Run the usage server in a multi-threaded mode (usage server is taking a lot of time, seems like single-threaded). The maximum number of threads is controlled by a new usage server property/configuration to the administrator. Explore/PoC if multiple nodes can be active at the same time.
Usage server related APIs to work with specified domainid/accountid and specific usage types (example: regenerating data for a certain usage type on a time interval).
In case of duplicates where healing is possible, the usage job must continue running and fix the issue. In other cases, an email or alert must be sent to the administrator for repairing.
Create an API to re-generate records between a start and end date (delete records for a slice of time/dates and insert new regenerated usage data records). Assumption: having accounts, events, and helper tables to regenerate usage data.
Add a global setting to set the number of years (minimum 1 year as default for fresh installations, and default value = -1 upon upgrade for backward compatibility) to keep the usage data and remove the older usage data.
Refactor and extend the sanity check task to ensure the usage types are covered, report any errors in a separate sanity check logs and send an email to the administrator.

Architecture and Design description

API Changes

The administrators must be able to re-generate usage data for a certain date range for all the accounts or specific account(s) of a domain:

A new API (regenerateUsageRecords) must be created to re-generate usage data for a start and end date interval.
The API must accept an optional parameter for an account or a domain ID. If this parameter is not set, then all the accounts are processed. If the parameter is set, then only the account or the domain accounts specified are processed.
The usage data generated for the selected date range is removed (if any) and replaced with the usage data re-generated by the API execution:
- If the account ID (or domain ID) parameter is set, then only the usage records for the account(s) are removed from database and the re-generated values are inserted
- If the account or domain ID are not set, then all the accounts’ usage data will be removed and re-generated
The API execution will create and persist a new job with a type = ‘REGENERATE’ to differentiate from other usage jobs.
The existing generateUsageRecords API must be fixed to allow re-generation if data from cloud_usage.cloud_usage is missing for the specified start/end date but as per the job metadata it is found to be available. The existing API does not regenerate the usage records properly, it only works if there is no usage data generated after the initial time set on the API

Service Layer, Schema and Other Changes

The usage job’s last step is the parsing of the helper tables into ‘cloud_usage’ records. The parsing is performed sequentially one account after the other and can be parallelized to increase its performance by a better CPU utilization. The proposed approach suggests using parallelisation in terms of the number of threads to process one account sequentially, i.e. each thread will process one account sequentially. The processing of the events is performed sequentially prior to the parallelisation of the accounts processing.

The helper tables parsing will become a parallel task which can be divided by the number of CPUs the usage server instance has. It is also proposed to expose a setting in the usage server’s configuration file for administrators configuration. As there is DB and possibly network I/O involved we can use more threads than CPUs available to optimise CPU usage.
The parallel tasks will receive a portion of the total number of accounts and will process them in parallel, using a synchronized queue of account jobs
The parallel tasks execution must synchronize on the database writing into the ‘cloud_usage’ table, making sure only one task at a time is able to write. As the writing to the database will be parallelised, then the order of the generated usage records is not kept as it currently is persisted on a per account basis. It is proposed to extend the list API for usage records to order the retrieved usage records on each query.
In case of multiple usage server instances, the tasks can be parallel and concurrent across the different usage server instances. This is to be explored, it is proposed to explore the benefits of multi-threading vs multi-process
The heartbeat task can be removed in favour of parallel coordination between tasks.
In case of any failure, such as violating MySQL constraints by attempting to insert duplicated records, then the usage job must fail and log the issues found accordingly. All the usage data that is generated up to the moment of the failure must be persisted in database, and the administrator must be notified via alert or email about the issues on the usage jobs. The next daily job must not re-generate the same data again.

The usage job must be more robust on its execution. One known issue with the current usage server implementation comes from duplicate event records causing usage job to fail and never succeed unless the duplicates are removed. To increase robustness one additional step can be added after the helper records creation and before their parsing. The additional step must perform the following actions:

Ensure that records on each the events and helper tables have no duplicates. The usage event parsers will be improved such that duplicate events will be checked and ignored if the case allows.
If there are duplicates, remove them and log the action with a suitable detailed message

If after the additional step there are failures on the usage jobs, then the job must fail detailing the reason, log the errors and send an alert to the administrator for taking actions.

A new global setting will be created to control the amount of years the old usage data will be kept in the ‘cloud_usage’ database for the ‘cloud_usage’ table records.

The global setting value will allow the usage server to automatically remove all the existing usage data with date prior to the current date minus the amount of years set.
The default value of the global setting will be 0 for existing environments, meaning that no clean-up will performed by the usage server on upgrades.
The minimum valid value will be 1 will be default for fresh installations.
A new task will be created in the usage server (can have a daily interval) to run when no usage job is running, and perform the removal of the usage data with date older than [current date - <YEARS>]. After the task is completed, the only usage data kept in database will have date between [current date - <YEARS>] and [current date]

The sanity check implementation must be refactored in favour of robustness and completeness of the checks:.

The sanity check execution will be enabled by default. Currently it is controlled by the global setting ‘usage.sanity.check.interval’, meaning that its default value must be set accordingly.
The sanity check task will have its own logs for easier checks by the administrator · The sanity check will ensure all the usage types are covered
The sanity check will check the number of usage records for a certain resource and usage type matches the expected number. For example, with a daily aggregation range and a single resource, it is expected that the number of usage records for the resource and a certain type is N, where N is the number of days between the start and the end date.
In case of any error, the sanity check will log the errors accordingly and send an email or an alert to the administrator
The sanity check execution will also include the events duplicates check apart from the usage type checks.

UI Changes

N/A

Marvin Tests

The following test must be included as a separate marvin class (e.g. have its own tag) to prevent it runs automatically on production environments as it will change configurations in the system, unless it is explicitly executed by an administrator. However, the test must be included on the Trillian automation

Create a few resources on a certain account such as running VMs
Set the global configurations for the aggregation range to a low value (around 5 minutes or less) and the usage job start time to a time close to the execution time so that a usage job can start almost immediately after the usage server is started
Start the usage server
Wait for 2 or 3 times the aggregation range minutes so that usage records can be generated.
Stop the usage server
List usage records for the account and ensure records have been generated for the resources created and a certain type, for example running time for VMs
- Ensure the number of records match the expected number depending on the time spent generating records
Wait for 2 or 3 times the aggregation range minutes, set the aggregation range to daily
Start the usage server
List the usage records for the account and check the number of records per a certain resource and usage type
Regenerate usage records for account on the date
List the usage records for the account and the same resource and usage type, and compare it to the number obtain above. Ensure the number of usage records is higher than the previous check.

Space shortcuts

Child pages

PR