Link to Dev list discussion 

https://lists.apache.org/thread.html/b339c4dd7068f6ae3c4262b82481e3b53f85fff368884e525703cb16@%3Cdev.mxnet.apache.org%3E

Feature Shepherd 

Anirudh Acharya

Problem

Data IO is often a bottleneck for training and inference workflows with image data. And as data size gets larger and is unable to fit in the main memory, data loading can bring down the performance of the workflow. Which is why it would be beneficial to have the image stored in the binary recordIO format, which is much more compact than raw image files, occupies less memory and more efficient while data loading.

The goal of this project is to have an easy to use and intuitive interface to pre-process image data and create recordIO files. Currently our customers have to clone the whole MXNet repository to use a command line tool to pre-process and create recordIO files from image datasets. This is inconvenient for our customers, with the proposed change the customers will be able to use this functionality straight out of the PyPi package.

Use cases

As a user, I’d like to have an API to convert a dataset of raw images into binary format and pack them as RecordIO files.

Open Questions

  1. Why is RecordIO a preferred format for image data in MXNet. Are there alternatives to it like Apache Parquet or Avro etc.. ?
  2. What are the options for editing an already created .rec file?
    1. the ideal solution is to rewrite the file as files are always read and written as streams of data and it would not be possible to add records in the middle of a file in-place. This cannot be seen as a drawback of the API as this limitation is shared by the reading/writing of any generic text file in Python or other programming languages. But reading, writing and editing of record files can be accomplished by read_idx() and write_idx() methods of the MXIndexedRecordIO object.

      '''
      Existing RecordIO files can perused and edited and rewritten using the below
      two code snippets of reading and writing RecordIO files.
      '''
      
      # Write a record
      label1 = [2,3]
      id1 = 2
      header1 = mx.recordio.IRHeader(0, label1, id1, 0)
      with open('img.jpg', 'rb') as fin:
          img = fin.read()
          s1 = mx.recordio.pack(header1, img)
      write_record = mx.recordio.MXIndexedRecordIO('img.idx', 'img.rec', 'w')
      write_record.write_idx(id1, s1)
      
      
      # Read record
      read_record = mx.recordio.MXIndexedRecordIO('img.idx', 'img.rec', 'r') 
      item = read_record.read_idx(2)
      header, img = mx.recordio.unpack_img(item)
      print(header.label)

Proposed Approach

Implement a new API in MXNet's Data IO API that accepts an image list file or a numpy array, and converts that data into recordIO file format and stores the file. The proposed approach will also parallelize and user will be given the option to set the number of threads he/she can use to perform this function. The proposed API will have the same functionality as an existing CLI tool, which is currently used by customers for creating .rec files, but customers will have the convenience of using this functionality from the PyPi package itself.

State of the existing tools

Creating recordIO files is accomplished using a command line tool. The tool accepts arguments to determine

  • how the output binary file will be packed - whether to split the data or not, and what ratio to be used for training and validation set.
  • What image transformations need to be applied to the raw images before they can be converted to recordIO format.

Each of these arguments are passed as parameters to the command line tool and the resultant .rec file is stored in the local folder. The current C++ tool runs as a single threaded process, whereas the python tool supports the use of multiple threads/workers.

Current drawbacks of the existing tool

  1. Customers are forced to fork the repo and use the CLI tool
  2. In case of missing files or corrupted raw image files, the whole process is terminated. The new API will log these failures to the console and continue generating the binary file, up till a user specified threshold.
  3. Lacks support for generating multi-part files by splitting the image list( Logically split the dataset into separate files and generate these parts selectively or all at once)
  4. Does not accept S3 buckets as data source.

Desired Functionality

The process of pre-processing image data and converting the dataset into RecordIO file format should be easy and intuitive for the user. Here is the ideal workflow the user should have -

imageTransform = transforms.Compose([
                            transforms.Resize(300),
                            transforms.ResizedCrop(224),
                            transforms.Brightness(0.1),
                            transforms.Normalize(0, 1)])
                            
mx.io.im2rec(list_filename, imageTransform, dataset_params, output_path)


He/She should be able to stack the desired image transformations into a gluon.transform.Compose and pass that object to the im2rec API which will apply those transforms and then pack the transformed images into recordIO files.


Multi-Reader Single-Writer Design for Creating .rec file

Multi-Reader Single-Writer Design for Creating .rec files

Addition of new APIs

im2rec API Specification

Given a list file with the following format
integer_image_index \t label \t path_to_raw_image

Transform the input raw images as per the stacked transforms specified in gluon.transforms and pack the image files into RecordIO format as per the parameters in the dataset_params dictionary object( See Appendix ), and return the path to the output .rec file.


im2rec
def mx.io.im2rec(list_file, transforms, dataset_params, output_path):
    """ 
    Convert image.list file containing the path to the raw images to binary files.
     Input Parameters - 
     ---------- 
     list_file - str object containing the path to the list file
     transforms - gluon.transforms.Compose object
     dataset_params - dict object whose description is given in the appendix
     output_path - string object containing the path to the output location
     
     Return type - 
     ---------- 
     rec_file_path - str object depicting the path of the output rec file 
    """
    return rec_file_path


Backward Compatibility

Post implementing the API the existing CLI tool will continue to exist, but users will also be directed to the new API and its accompanying documentation/tutorials.

Performance Benchmarks

<>

Alternative Approach

One of the initial approaches I came up with involved having each of the image transforms and dataset_params as a parameter to the API. This will end up creating an API with potentially 10-15 parameters and adding/removing more transforms or parameters might be difficult and could lead to API breakage. Hence using gluon.transforms was preferred.

APPENDIX

Dataset Params

These parameters describe how the record files will be packed sequentially in the .rec file


Parameter

Default Value/Optional

Description

num_workers
1
Have multiple workers doing the job. This option will imply shuffling the dataset.
batch_size4096
pass_throughFalse


pack_label


parts
1
used for part generation, logically split the .lst file to NSPLIT parts by position

Supported Gluon Transforms

These transforms are to perform pre-processing of images. Each of these transforms will be implemented as a gluon transform functions. The proposed API spec accepts transforms.Compose which is of type SequentialBlock. This SequentialBlock will contain a stack of transforms that will be applied. The user could define his own HybridBlock and include it in the SequentialBlock, to make it extensible.


Parameter

Default Value/Optional

Description

resize

Optional

resize the image to the newsize [width, height]

center_crop
0
specify whether to crop the center image to make it square.
1 - perform cropping
0 - no cropping

quality

95 for JPEG;
9 for PNG
JPEG quality for encoding (1-100, default: 95)
PNG compression for encoding (1-9, default: 3).
color-1Force color (1), gray image (0) or keep source unchanged (-1)
encoding[‘.jpg’]Encoding type. Can be '.jpg' or '.png'
inter_method1Image interpolation methods.
NN(0) BILINEAR(1) CUBIC(2) AREA(3) LANCZOS4(4) AUTO(9) RAND(10)



  • No labels

2 Comments

  1. list_filename,
    does it have to be in a specific format?
  2. We got an input from one of MXNet's power user(Ben Taylor from Ziff.ai) the need to be able to append to an existing RecordIO as new training data arrives, with the current CLI you have to restart from scratch.

    Would be great if we could add this to the scope or the next iteration of this work.