Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
outlinetrue


Link to Dev list discussion 

discussion - https://lists.apache.org/thread.html/b339c4dd7068f6ae3c4262b82481e3b53f85fff368884e525703cb16@%3Cdev.mxnet.apache.org%3E

Feature Shepherd 

Shepherd - Anirudh Acharya

Problem

Data IO is often a bottleneck for training and inference workflows with image data. And as data size gets larger and is unable to fit in the main memory, data loading can bring down the performance of the workflow. Which is why it would be beneficial to have the image stored in the binary recordIO format, which is much more compact than raw image files, occupies less memory and more efficient while data loading.

The goal of this project is to have an easy to use and intuitive interface to pre-process image data and create recordIO files. Currently our customers have to clone the whole MXNet repository to use a command line tool to pre-process and create recordIO files from image datasets. This is inconvenient for our customers, with the proposed change the customers will be able to use this functionality straight out of the PyPi package.

...

As a user, I’d like to have an API to convert a dataset of raw images into binary format and pack them as RecordIO files.

Open Questions

...

  1. Why is RecordIO a preferred format for image data in MXNet. Are there alternatives to it like Apache Parquet or Avro etc.. ?

Proposed Approach

Implement a new API in MXNet's Data IO API that accepts an image list file or a numpy array, and converts that data into recordIO file format and stores the file. The proposed approach will also parallelize and user will be given the option to set the number of threads he/she can use to perform this function. The proposed API will have the same functionality as an existing CLI tool, which is currently used by customers for creating .rec files, but customers will have the convenience of using this functionality from the PyPi package itself.

...