F.A.Q / Troubleshooting - Distributed Training

# Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
#   Licensed under the Apache License, Version 2.0 (the "License").
#   You may not use this file except in compliance with the License.
#   A copy of the License is located at
#
#       http://www.apache.org/licenses/LICENSE-2.0
#
#   or in the "license" file accompanying this file. This file is distributed
#   on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
#   express or implied. See the License for the specific language governing
#   permissions and limitations under the License.
# ==============================================================================

The following lists are the frequent problems and troubleshoot in regarding to running distributed training with Horovod and executing MXFusion’s code in GPU.

ValueError while executing horovodrun

Problem

After recently installed Horovod in the machine, the following error may occur when executing the code with horovodrun on terminal:

ValueError: Neither MPI nor Gloo support has been built. Try reinstalling Horovod ensuring that either MPI is installed (MPI) or CMake is installed (Gloo).

Steps to Reproduce

After installing Horovod with pip install horovod==0.16.4, execute a MXFusion distributed training script with horovodrun -np {number_of_processors} -H localhost:4 python {python_script}

Solution

Use mpirun instead of horovodrun. For example on terminal, type :

mpirun -np {number_of_processors} -H localhost:4 python {python_script}

Warning of CMA Support Not Available

Problem

When first executing MXFusion with Horovod every time Ubuntu boots, the ptrace protection from Ubuntu blocks CMA support from being enabled, which then does not allow shared memory between processors. A warning will be shown in the terminal :

Linux kernel CMA support was requested via the btl_vader_single_copy_mechanism MCA variable, but CMA support is not available due to restrictive ptrace settings.

Steps to Reproduce

After Ubuntu boots, execute a MXFusion distributed training script with mpirun -np {number_of_processors} -H localhost:4 python {python_script}

Solution

Temporarily disable ptrace protection by typing the line below on the terminal. Note that you may need to reenable it back with echo 1 after stopped using Horovod for security measures. Also note that ptrace_scope will be resetted to 1 every time Ubuntu boots. To disable ptrace protection, on terminal type :

echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope

Segmentation fault : 11 with MXNet-cu100

Problem

When executing MXFusion on GPU, error of Segmentation fault : 11 will be thrown if MXNet-cu100 is installed.

Steps to Reproduce

Install MXNet-cu100 with pip install mxnet-cu100 on Deep Learning AMI (Ubuntu) Version 24.1 (ami-06f483a626f873983). Run a MXFusion distributed training script with mpirun -np {number_of_processors} -H localhost:4 python {python_script}.

Solution

Uninstall MXNet-cu100 with and install MXNet-cu100mkl. On terminal, type :

pip uninstall mxnet-cu100 pip install mxnet-cu100mkl

Segmentation fault : 11 with latest version of Horovod

Problem

MXFusion currently does not support Horovod version 18 and above. With latest version of Horovod, when running MXFusion distributed training on CPU, the loss function and output will be inaccurate and inconsistent between processors. When running MXFusion distributed training on GPU, Segmentation fault : 11 error will be thrown.

Steps to Reproduce

Install Horovod with pip install horovod. Run a distributed training script with mpirun -np {number_of_processors} -H localhost:4 python {python_script}.

Solution

Currently MXFusion supports Horovod below version 18. Install the latest version of MXFusion before version 18 with :

pip install horovod==0.16.4

Error with dtype=’float64’ on GPU

Problem

When setting float64 as the data type and run the script on GPU, this error may occur :

mxnet.base.MXNetError: src/ndarray/ndarray_function.cu:58: Check failed: to->type_flag_ == from.type_flag_ (1 vs. 0) : Source and target must have the same data type when copying across devices.

Steps to Reproduce

In a GPU, change the value of config.DEFAULT_DTYPE and dtype of NDArray to ‘float64’ in distributed_bnn_test.py. Run the test. The error will occur in test_BNN_regression and test_BNN_regression_minibatch. In the terminal, from MXFusion source root folder, type :

cd testing/inference mpirun -np 4 -H localhost:4 pytest -s distributed_bnn_test.py

Solution

Set float32 as the data type. GPU also supports float32 at better speed than float64.