Machine Learning Lab

Check CPUs workload:

We have a simple script for you to check the workload of all machines, you may run: /cs/home/hj/bin/available_computers.pl. Every time when you submit a new job, please use this command to look for a free or light-loaded machine. For a 6-core machine, we normally should not have its workload over 6.
Run Linux 'htop' command to check the CPU load, memory usage in each machines.
If your program consumes lots of memory (over 10G), DON’T submit it more than once to a single machine.

Check GPUs workload:

We have a simple script for you to check the workload of all machines, you may run: /cs/home/hj/bin/AllGPUStat.sh.
To check one server equipped with GPU, the GPU summary can be retried by “nvidia-smi”. As long as the remaining memory meets your memory need, it’s runnable. However, it may not progress since the GPU utilization is high. If there are 2 programs executing on the same GPU and one of them allocates too much memory, BOTH programs crash. “nvidia-smi” is not available on OSX.

In most machine learning framework, the first GPU is picked by default. Tensorflow, for example, will pre-allocate a chunk of memory on EVERY SINGLE GPU if you don’t explicitly mask the unneeded. Masking can be done by, for example “setenv CUDA_VISIBLE_DEVICES 1”, if you only want to expose the second GPU (GPU is 0-indexing).