change publish OS (Severe)

Scenario

As we know, Ubuntu 14.04 will no longer be supported by Canonical. Although they won't shut down their server, still no patches or upgrade will be happened. If we continue to do so, there will be potential security risks when we publish the package. We also need to change a lot more to keep using the public version of Ubuntu 14.04

In this case, it is expected to use the next version of LTS system such as 16.04 to publish all of the packages. However, due to the test on the package built from there, Sheng found that the GLIBC version was not compatible with Cent OS 7 with the follow error:

/lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.23' not found (required by /tmp/mxnet6145590735071079280/libmxnet.so

GLIBC is a library that shipped with a fixed version in different system. It cannot be changed easily either to upgrade or downgrade as all packages that distributed with the system would potentially be unstable. In this case, we may lose Cent OS 7 and Amazon Linux support entirely if we decide to go with 16.04 build. We cannot static-link GLIBC since it is under GPL.

The followings is the list of GLIBC version that different system used:

14.04:
ubuntu@ip-172-31-19-57:~$ ldd --version
ldd (Ubuntu EGLIBC 2.19-0ubuntu6.14) 2.19


16.04
ubuntu@ip-172-31-37-210:~$ ldd --version
ldd (Ubuntu GLIBC 2.23-0ubuntu10) 2.23


Cent OS 7:
[centos@ip-172-31-13-196 ~]$ ldd --version
ldd (GNU libc) 2.17

Proposed Solution

In order to solve this issue, I propose several solutions listed below:

Build with different GLIBC (Not tested)
https://www.tldp.org/HOWTO/Glibc2-HOWTO-6.html It is still worthwhile to configure GLIBC in a system that all builds will be based on. This could be the ideal solution as we can still use the up-to-date system and make it compatible with all previous versions.

Still using 14.04
As mentioned above, we can still use 14.04 even if the supporting life-cycle is done. By adding an archive repository in the system would help to keep it available with apt-get install command. The safer way will be a docker image that contains all configuration in the system and we do not require apt-get install anymore to build the whole package. Moreover, 14.04 should not be used to do the publish as there could be potential security problem. Instead, a system before End of Life should be used specifically for the publish. In our case, only the backend build contains the requirement for the GLIBC version. Then we can keep PyPi and Maven publish out of the 14.04

Using Cent OS 7
As we still need to maintain the support on Cent OS and Amazon Linux system, the best solution is to choose an OS that still have support. In this case, Cent OS 7 could be the best one to migrate our building script to. However, all of the current GPU build scripts would be unavailable since nvidia does not provide the corresponding packages for rpm. In this case, we may need to go with NVIDIA Docker for Cent OS 7 and that only provide a limited versions of CUDA. Another problem we may see is the performance and stability difference on the backend we built since we downgrade GLIBC from 2.19 to 2.17

List of CUDA that NVIDIA supporting for Cent OS 7:
CUDA 10, 9.2, 9.1, 9.0, 8.0, 7.5

Drop the support for Cent OS 7 and Amazon Linux and keep it with 16.04 build
We still support build-from-source instructions for the users using these two systems.

gcc/gfortran version upgrade (Important)

Scenario

Currently, we use the GCC version 4.8 to build all of our dependencies in order to compatible with different CUDA versions. However, some future architecture require gcc 5.0 or above to build together such as Horovod. In this case, we need to make them compatible. There maybe unforeseen problems such as backward compatibility or stability issue.

Solution

We simply upgrade our GCC version from 4.8 to 5.x to make them compatible

static library version control (Improvement)

Scenario

As Frank Liu discovered in here: MXNet build dependencies, we are facing issues with the static library. The version chosen here are questionable and could not be easily maintained. Apart from that, some dependencies such as libzmq holds a GPL License and Apache Legal forbid that from using. In this case, we need to find an alternative way to build the dependencies here.

For example, we are currently using a beta version for lib-turbo. We use a non-stable openblas which should be downgrade to a stable version. In this case, we should choose a stable release for them for the best performance. We need to dig in and clarify the reasons behind our choices of different versions of the packages.

Solution

There is no ideal ways to automate this process and require manual check and benchmark to choose the best performance set. We also need to get rid of the usage of libzmq or consult with legal team to see if there are any alternatives. We need to take action on PS-LITE side to make the change.

Number of packages supporting (Good to have)

Scenario

We are currently the 'beast' on Pypi, along with Tensorflow that taken over 40% of the total package sizes. It is due to the matrix supporting of our packages. We offer a bunch of CUDA versions with combination of MKL as well as Python versions. It is a trade-off on widely version support and maintenance nightmare. There is no clear solution how we should handle this whether to reduce the number of packages we publish or keep it as it is.

Solution

We bump with CUDA. 

  • No labels

1 Comment

  1. Great summary Qing.

    • Agree with the approach on libc, we should provide some compilation path such that we use as old a version as practical for portability purposes.  
    • Compiling on CentOS7 might be a good option.
    • Moving to gcc5 shouldn't cause any issues I'm aware of (with nvcc for example).
    • Also agree that GPL prohibits static linking, so if we're doing that we'll have to stop.

    I have two questions relating to automated deployments:

    • I've also gotten feedback from Pypi that we're using a disproportionate amount of resources on their service, and we should of course be respective of their system.  I'm wondering if we're pushing packages daily to Pypi, or do we only push packages during a release?  If we're pushing packages daily, can we move to just pushing packages during a release?  What would the implications be?  Also can we begin garbage collecting old releases in order to use less resources?  Do we do this already?
    • Is there a need to tightly couple our CI system with the actual artifacts generated and pushed?  Are we creating the artifacts we push to users with a service we give all internet users access to?  I understand that we'd want to ensure these artifacts are built in CI so that we don't accidentally regress and break the package creation flow, but maybe we could have a separate system that creates and pushes the artifacts (after pulling from a release branch)?

    Link to feedback I've received from Pypi: https://github.com/pypa/warehouse/issues/4686