STATUS: Draft, WIP


This document captures requirements for making AI actions a first-class citizen in Openwhisk. The main ideas captured bellow are inspired from the work done in running deep learning inferences on images and videos.

TL;DR

  • A new action kind ( python3-ai ) that contains most frameworks used for ML/DL ( i.e. scipy, tensorflow, caffe, pytorch, and others )
  • GPU support
  • tmpfs for sequences and compositions / "shared memory FaaS" ( Carlos Santana) / "serverless datastore" ( Rodric Rabbah )
  • Higher memory limits ( 4G+)
  • Longer execution time

New action kind - "python3-ai"

Problem

ML/DL frameworks are large in size. With the existing python3 action kind developers have to create large ZIP packages ( 1GB+) that hit the max action size limitations.

Workarounds

Blackbox actions are already supported by OpenWhisk, but given the size of the docker images which are in GBs, cold-start times are highly impacted. Downloading large docker images may add latencies up to minutes, leaving a large footprint on the used network, and increasing the space used by docker images on each host.  

Proposal

Create a new action kind: python3-ai 

Include the following frameworks and libs: scipy, tensorflow, pytorch, caffee2, and others that the community may ask for. See Google ML Engine  version as a more complete reference of libraries.

The image size would be ~6GB, and ideally it would be prewarmed in the OW cluster by running docker pull ... commands on each host. 

"nodejs-ai" should also be considered given the Tensorflow JS library.


GPU support

Problem

Processing time is greatly reduced when using GPU.

Workarounds

Use CPU and parallelize.

Parallelization is actually the sweet spot that makes AI actions perform better than any other linear model, even when compared to a linear GPU processing. For example compare 500 parallel actions running on CPU that process a data input in 10 seconds each, vs 1 GPU action performing the same 500 operations sequentially, with each operation taking 1s. The GPU action produces a response in 500s, while the 500 parallel actions produce the response in 10s. 

Proposal

Allow developers to specify GPU resources when creating actions:

wsk action create my-action code.py --gpu [number_of_gpus]

Depending on docker support for GPU, developers should be able to specify how many GPUs the action needs. 


tmpfs for sequences and compositions 

This seems to be a long standing demand in the community as well:

Problem

Serverless application decomposition encourages the split of tasks into more granular functions. This is the source of 2 categories of problems: 

Problem 1 - Transferring large assets between actions

It's hard to pipe data between actions, when its size is over the system limits.

Problem 2 - Multiple actions processing the same asset

In case multiple actions are processing the same asset, each action needs to independently re-download the same asset, resulting in unnecessary network traffic.

Workarounds

Send Asset by Reference

Developers can use a 3rd party blob storage such as AWS S3 to upload the asset, and pass it to the next action using the URL of the asset, as "the reference" to it.

The problem with this workaround is that the size of the asset influences the performance. Having each action uploading and downloading to/from a blob storage for each activation increases the execution time. The bigger the asset, the bigger the impact.

Developer experience is also impacted by the fact that developers must manage the integration with a 3rd party blob storage. This implies that:

  • developers are responsible to perform regular credential rotation for the blob storage, 
  • developers must manage multi-region buckets, corresponding to the regions the actions may be invoked in, to optimize the latencies
Combine multiple actions in one action

Developers can combine the operations that act on the same asset into a single action. This workaround makes it harder to reuse actions in sequences or compositions.

It also restricts developers from using a polyglot implementation with action written in multiple languages; for instance the AI actions could be written in python, and combined with JS actions that may download the asset at the beginning of the sequence, and upload it at the end, by the last action in the composition.

Compress Asset and include it in the payload

If the asset format can be further compressed, it could be sent as Base64 into the JSON payload for the next action. 

This workaround it still constrained by configurable system limits, and it makes the action code more complex than it needs to be.

Measuring the impact

The problem described above results in:

  • Poor developer experience
  • Poor execution performance

In order to asses the impact we need a methodology to quantify how big this problem is: 

Proposal to measure the impact

Measuring execution performance 

Questions:

  1. How much time is spent in moving the content around vs processing it ?
  2. What's the impact of using a Blob Storage when compared to an NFS/EFS-like systems ?
Measuring developer impact
  1. Maintenance overhead 
  2. Implementation time 
  3. Additional code complexity 

Proposal

The proposal is to transparently provide developers with a temporary way to store large assets in the OW cluster. This is probably the hardest problem to solve when compared to the other ones b/c it involves persistence, state, and possibly handling large amounts of data. Bellow are listed a few possible options:

Namespace volumes

(credits: Cosmin Stanciu for suggesting this simpler solution)

Namespaces can be provided with their own volume, one volume per namespace, limited in size ( i.e 10GB). The programming model should handle this volume as short-lived, hence its limited size as well. Its purpose is to make it efficient for sequences and compositions to access assets faster than if they were downloaded from an eventually consistent blob storage. The volume is independent of activations lifecycle. It's up to the developers to manage it, including cleaning it up.

OpenWhisk's implementation could configure a network file system like NFS, (i.e. AWS EFS), GlusterFS, or others. This volume is mounted on each Invoker host. Each namespace has its own subfolder on that volume, and it can be mounted appropriately into a known path in the action containers that want to make use of it. To simplify the stem-cells management, each action requiring access to such volume may ignore stem-cells; instead, they may go through a cold-start, as if they're blackbox containers.

The diagram bellow shows 2 Invoker Hosts that mount a network file system on /mnt/actions on the host. Inside the action containers the /tmp the folder is mounted from the host folder corresponding to the namespace: /mnt/actions/<NAMESPACE>

Action volumes

Allow developers to "attach" a persistent disk to an action. The programming model in this case assumes there's always a folder available on the disk on a well known path defined by the developer. OpenWhisk's implementation could leverage solutions particular to a deployment:

Volumes could be local to a single host, or backed by a distributed FS, or blob storage.

  • Local disk 
    • PROs: The most performant option for a FaaS model because activations are usually short, volumes can be destroyed when a sequence finished, and network is not used.   
    • CONs: Limited to the max number of containers that can be executed in parallel on a host. 
  • Distributed FS
    • PROs: sequences can be long (many actions), actions can run on any host. 
    • CONs: dependent on the network speed, and network congestion. 

The WSK CLI could look like in the example bellow:

wsk action create read-asset read.js --volume my-action-volume
wsk action create action-1 foo.py --volume my-action-volume
wsk action create action-stateless code.js
wsk action create action-2 bar.py --volume my-action-volume
wsk action create write-asset write.js --volume my-action-volume
wsk action create my-sequence --sequence read-asset,action-1,action-stateless,action-2,write-asset

When running the sequence my-sequence, the Openwhisk scheduler can use the volume information defined by each action, to schedule the actions requiring the same volume on the same host. When the read-asset action starts, it creates a new volume that should be uniquely identified. For instance the volume name, the namespace of the subject invoking, and a UUID, could be used.  I.e. my-action-volume--guest--<UUID>.   When action-1 starts, it mounts the same volume created previously, same for action-2 and write-asset actions. Note that the sequence may have other actions that don't need the volume such as action-stateless; in such cases the actions can be scheduled anywhere in the cluster, as per the existing scheduling mechanism. 

The volume my-action-volume can be mounted inside each container on: /mnt/ow/my-action-volume. 

Activation result should return the volume info so that the next action in the sequence runs on the host where the volume was created.

The volume should be removed once the prewarmed actions are destroyed. 

Open questions:

Should the developers specify a size for the volume, or assume a default size i.e. 10GB which is configured at the system level  ? 

TODO: Describe congestion scenarios: (1) wait-time due to a prewarmed action in the sequence processing another request (2) lack of resources to cold start a new container on the same host. 

Cluster cache

Provide developers with a caching solution in the cluster. Developers would still pass larger assets by reference between actions, they would still write code to upload or download an asset, but use a cache provided by OpenWhisk, inside the cluster. The cache can be exposed through the Openwhisk SDK using 2 methods: write(asset_name, asset_value), read(asset_name).

The implementation could use a distributed in-memory cache, a distributed FS, or a blob-storage. 

The problem with this approach is the network bottleneck; even if the action ends up coincidentally on the same host with other actions in the sequence, it would still consume network bandwidth to write or read an asset from the cluster cache, hence it's dependent on the network performance.

Streaming responses between actions

In this option there's no persistence involved, but actions may communicate directly, or through a proxy capable to stream the response.

  • Direct action-to-action communication. The most performant option would be to allow actions to communicate directly, and ideally schedule them on the same host. The problem introduced by this approach is the network isolation. Usually OpenWhisk deployments create a network perimeter around each action, preventing it to poke other actions in the cluster. For action-to-action communication to be opened, each namespace should probably have its own network perimeter.
  • Action-to-proxy-to-action. This is similar to how Kafka is being used currently. But to workaround kafka payload limits, another proxy could be used. The communication should be secured in this case as well, so that only the allowed actions can communicate.      

Higher memory limits

Problem

Some DL/ML algorithms need more memory to process a data input. Existing limits are too low.

Workarounds

Configure Openwhisk deployments with higher memory limits. 

Proposal

TBD:  allocation implications in Scheduler / Load Balancer. 

Longer execution time 

Problem

During a clod start, an AI action may have to do a one-time initialization, such as downloading a model from the model registry. 

Or it may take longer for some actions to perform an operation

Workarounds

Configure OpenWhisk deployment with higher execution time settings ( CONFIG_whisk_timeLimit_max )

Proposal

TBD: side effects for increasing the timeout.