cloud-computing cloud-management cost-optimization deep-learning distributed-training gpu hyperparameter-tuning job-queue job-scheduler llm-serving llm-training machine-learning ml-infrastructure mlops ml-platform multicloud slurm spot-instances tpu