Page History

...

As a user, I would like to have an out of the box feature in Audio Data Loader and Audio of audio data loader and some popular audio transforms in MXNet, that would allow me :

to be able to load audio (only .wav files supported currently) files and make a Gluon AudioDataset (NDArrays),
apply some popular audio transforms on the audio data( example scaling, MEL, MFCC etc.),
load the Dataset using Gluon's DataLoader, train a neural network ( Ex: MLP) with this transformed audio dataset,
perform a simple audio data related task such as sounds classification - 1 audio clip with 1 label( Multiclass Multi-class sound classification problem).
Provide have an end to end example for a task (Urban Sounds Classification) including:

reading audio files from a folder location (can be extended to S3 bucket later) and load it into the AudioDataset
apply audio transforms
train a model - neural network with the AudioDataset or DataLoader
perform the multi class classification - conduct inference

Note: The plan is to have a working piece of this model with an example into the contrib package of MXNet before it is agreed upon to move this implementation to the gluon.data module.

...

The AudioFolderDataset has a function to initialize the dataset with the filename(audio) and the corresponding label, in line with the ImageFolderDataset,
The design permits passing a transform function(available in gluon.contrib.data.audio.transforms package while creating a dataset which is called with lazy = Falseto perform the transform in one shot for all the data.
A Loader transform is a special transform which loads the wav file into an NDArray. It can be viewed as a transform that converts an audio(wav) file into an NDArray. Any compose will consist of this as a first transform that is applied to an audio.
a) Currently, the reason of keeping loading the audio into NDArray as a Gluon transform is keeping a scope of using some other libraries in the future to read the audio wav file into the NDArray.
In this way, we just need to modify the Gluon audio transform instead of changing the AudioFolderDataset class.Any transform defined, is of type Block and extracts features(example 'mfcc') out of it and returns the transformed NDArray.
a) For now the Transform classes are not extended from HybridBlock as librosa is used to load and compute features which requires conversion from input NDArray to numpy for Librosa’s feature extraction feature extraction and then back to NDArray for iteration using DataLoader.

...

#Defining the transforms to use

audio_transforms = Compose([gluon.contrib.data.audio.transforms.Loader, gluon.contrib.data.audio.transforms.MFCC ])

#Defining the Data Loader specifying the transform composed above

audio_train_loader = gluon.data.DataLoader(aud_dataset.transform_first(audio_transforms, lazy=False), batch_size=32, shuffle=True)

...

1) Phase 1 - where we have this design implemented and have an example to use the AudioFolderDataset to perform the Sounds classification task with minimal transforms applied to the audio dataset - Done

2) Phase 2 - a more rich variety of audio transforms that can be used with some important loss functions which are instrumental in dealing with audio data ,

3) Phase 3 - Allowing the feature of splitting the audio samples into chunks with context of audio frames and allow data loading in randomized batches across multiple files being read from an S3 bucket.

...

- Done

Current Implementation status:

Currently, the same design is used to build an example in MXNet Gluon to perform an audio data related task - Multi class classification. The PR that was opened is https://github.com/apache/incubator-mxnet/pull/13325.

PR Status - Merged.

Future Scope:

This design can be further iterated upon to build an API for AudioDataset in MXNet Gluon contrib package to make a generic audio dataset which can be loaded using Gluon's OOTB Dataloader.
Although librosa is used by internal Amazon customers to load and extract features from audio, after the support of MXNet's FFT operator on CPU, some transforms(like MEL, MFCC) can be implemented as a HybridBlock which will enhance the overall feature extraction performance.
Additional transforms can be added which could be later made as operators in MXNet.
Examples: MuLaw Encoding, Mu Law decoding and so on.
A Dataloader can be designed or extended from the existing Gluon's dataloader, which will allow to create chunks (collection of audio samples) providing some overlap. This will be required when splitting audio files that are long into chunks with some context. This can be applied to some ASR(Automatic Speech Recognition) and TTS(Text to Speech) applications.

References

Librosa library to load audio(wav) files and extract features out of it. -here

...

Page tree

Versions Compared

Old Version 2

New Version Current

Key

References