Open Issues Need Help
View All on GitHub [Bug]: Checkpoint save crashes with TypeError (affects all instance types) on 3.test_cases/pytorch/FSDP about 2 hours ago
bug good first issue
Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
Shell
#aws#awsbatch#distributed-training#efa#eks#generative-ai#gpu#hyperpod#llm-training#parallelcluster