This is an old revision of the document!
Check CPUs workload:
- We have a simple script for you to check the workload of all machines, you may run: /cs/home/hj/bin/ Every time when you submit a new job, please use this command to look for a free or light-loaded machine. For a 6-core machine, we normally should not have its workload over 6.
- Run Linux 'htop' command to check the CPU load, memory usage in each machines.
- If your program consumes lots of memory (over 10G), DON’T submit it more than once to a single machine.
Check GPUs workload:
- We have a simple script for you to check the workload of all machines, you may run: /cs/home/hj/bin/
- To check one server equipped with GPU, the GPU summary can be retried by “nvidia-smi”. As long as the remaining memory meets your memory need, it’s runnable. However, it may not progress since the GPU utilization is high. If there are 2 programs executing on the same GPU and one of them allocates too much memory, BOTH programs crash. “nvidia-smi” is not available on OSX.
In most machine learning framework, the first GPU is picked by default. Tensorflow, for example, will pre-allocate a chunk of memory on EVERY SINGLE GPU if you don’t explicitly mask the unneeded. Masking can be done by, for example “setenv CUDA_VISIBLE_DEVICES 1”, if you only want to expose the second GPU (GPU is 0-indexing).
[pureldap] Unable to connect to server(s):
[pureldap] Automatic bind failed. Probably wrong user/password.
check_workload_cpu_gpu.1499717104.txt.gz · Last modified: 2017/07/10 20:05 by hj