1. Problem

Currently, there is no mechanism to verify consistency and integrity of models trained on older MXNet versions. Any unwarranted change in an underlying model saving/loading API could potentially break the backwards compatibility across MXNet versions. 

2. Goal

Model Backward compatibility aims to check whether models trained on earlier versions of MXNet are loading fine on the latest version or the latest release candidate. It also aims to do a sanity check for consistency in the inference on these trained models.

3. Approach

Here's a proposed approach to do this : 

  1. Create simple models on earlier versions of MXNet, initialize them with randomly generated weights, perform a forward pass on them. Save the model and model parameters and upload them in an S3 bucket. 
  2. As a continuation of previous step, perform a simple inference on randomly generated input and save the randomly generated input as well as the inference output along with the model files on S3. 
  3. The inference script running on the latest master branch of MXNet repository, would pull the model files + data and would try to load the models back into memory. The tests would fail if the models fail to load into memory or they give a different inference output. The different inference output could indicate or flag a potential change in an underlying operator. 
  4. Use the same seed values to ensure we have the same environment for both training and inference files. 
  5. These tests could be a part of nightly tests and would help in flagging out the above mentioned issues.
  6. Primarily the model backwards compatibility checker would cover the following APIs to save/load models : 
    1. Declarative Models load_checkpoint() from Model API
    2. Gluon Models load_parameters/save_parameters API from Gluon Package
    3. Gluon Models load_params/save_params API from Gluon Package
    4. TheHybridized models import/export API from Gluon Package

4. Current work

A first cut of model backwards compatibility using the above approach has been implemented here : https://github.com/apache/incubator-mxnet/pull/11626We would want to make it more robust and would like to get feedback on the ways in which we can improve this further.

  • No labels