Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Why is RecordIO a preferred format for image data in MXNet. Are there alternatives to it like Apache Parquet or Avro etc.. ?
  2. What are the options for editing an already created .rec file?
    1. the ideal solution is to rewrite the file as files are always read and written as streams of data and it would not be possible to add records in the middle of a file in-place. This cannot be seen as a drawback of the API as this limitation is shared by the reading/writing of any generic text file. But reading, writing and editing of record files can be accomplished by 

      read_idx() and write_idx() methods of the MXIndexedRecordIO object.

      Code Block
      languagepy
      themeEclipse
      linenumberstrue
      label1 = [2,3]
      id1 = 2
      
      header1 = mx.recordio.IRHeader(0, label1, id1, 0)
      
      with open('img.jpg', 'rb') as fin:
          img = fin.read()
          s1 = mx.recordio.pack(header1, img)
      
      write_record = mx.recordio.MXIndexedRecordIO('img.idx', 'img.rec', 'w')
      write_record.write_idx(id1, s1)
      
      
      # Read record
      read_record = mx.recordio.MXIndexedRecordIO('img.idx', 'img.rec', 'r') 
      item = read_record.read_idx(2)
      header, img = mx.recordio.unpack_img(item)
      print(header.label)


Proposed Approach

Implement a new API in MXNet's Data IO API that accepts an image list file or a numpy array, and converts that data into recordIO file format and stores the file. The proposed approach will also parallelize and user will be given the option to set the number of threads he/she can use to perform this function. The proposed API will have the same functionality as an existing CLI tool, which is currently used by customers for creating .rec files, but customers will have the convenience of using this functionality from the PyPi package itself.

...