From 59d71a33edda8568b9be2d462dd4105393371016 Mon Sep 17 00:00:00 2001 From: dania-sami <147648301+dania-sami@users.noreply.github.com> Date: Tue, 9 Sep 2025 14:52:12 +0200 Subject: [PATCH 01/16] Add Week 4 Scientific Paper Proposal (Oscar Arbman & Dania Sami) --- .../week4/gpu-utilization-dsami.md | 30 +++++++++++++++++++ 1 file changed, 30 insertions(+) create mode 100644 contributions/scientific-paper/week4/gpu-utilization-dsami.md diff --git a/contributions/scientific-paper/week4/gpu-utilization-dsami.md b/contributions/scientific-paper/week4/gpu-utilization-dsami.md new file mode 100644 index 0000000000..ca841e2ecd --- /dev/null +++ b/contributions/scientific-paper/week4/gpu-utilization-dsami.md @@ -0,0 +1,30 @@ +# Assignment Proposal + +## Title +An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024) + +## Names and KTH ID +- Oscar Arbman (oarbman@kth.se) +- Dania Sami (dsami@kth.se) + +## Deadline +- Week 4 + +## Topic +- MLOps / AIOps / LLMOps + +## Category +- Scientific paper + +## Description +This paper presents an empirical study on GPU underutilization in deep learning training jobs, a critical challenge for both MLOps and large-scale ML systems. The authors analyze more than 400 real-world jobs, of which nearly 50% exhibited GPU utilization ≤50%. They uncover 706 issues related to data pipelines, model configurations, and resource scheduling. + +The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to **7.5× performance improvements**. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. + +**Link to article:** +[An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024)](https://www.microsoft.com/en-us/research/publication/an-empirical-study-on-low-gpu-utilization-of-deep-learning-jobs/) + +## Relevance +This paper is highly relevant to MLOps/AIOps/LLMOps, since efficient GPU usage is central to deploying ML systems in production. Low GPU utilization directly affects training time, operational cost, and scalability of ML workloads. + +By analyzing real-world operational challenges and offering a framework for diagnosing and fixing inefficiencies, this work provides practical insights for DevOps engineers and ML practitioners. It connects directly to the course theme of DevOps for ML systems, emphasizing resource management, pipeline reliability, and system-level optimization. From 930ffb3ef199d40db2d73a2412428e4a62517e20 Mon Sep 17 00:00:00 2001 From: dania-sami <147648301+dania-sami@users.noreply.github.com> Date: Tue, 9 Sep 2025 15:11:31 +0200 Subject: [PATCH 02/16] Rename file to oarbman-dsami.md (Week 4 Scientific Paper Proposal) --- .../week4/{gpu-utilization-dsami.md => oarbman-dsami.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename contributions/scientific-paper/week4/{gpu-utilization-dsami.md => oarbman-dsami.md} (100%) diff --git a/contributions/scientific-paper/week4/gpu-utilization-dsami.md b/contributions/scientific-paper/week4/oarbman-dsami.md similarity index 100% rename from contributions/scientific-paper/week4/gpu-utilization-dsami.md rename to contributions/scientific-paper/week4/oarbman-dsami.md From 4cb8f4f14a0eaee494969df7b70116154bcf4280 Mon Sep 17 00:00:00 2001 From: dania-sami <147648301+dania-sami@users.noreply.github.com> Date: Tue, 9 Sep 2025 15:31:32 +0200 Subject: [PATCH 03/16] Create README.md --- .../week4/oarbman-dsami/README.md | 30 +++++++++++++++++++ 1 file changed, 30 insertions(+) create mode 100644 contributions/scientific-paper/week4/oarbman-dsami/README.md diff --git a/contributions/scientific-paper/week4/oarbman-dsami/README.md b/contributions/scientific-paper/week4/oarbman-dsami/README.md new file mode 100644 index 0000000000..ca841e2ecd --- /dev/null +++ b/contributions/scientific-paper/week4/oarbman-dsami/README.md @@ -0,0 +1,30 @@ +# Assignment Proposal + +## Title +An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024) + +## Names and KTH ID +- Oscar Arbman (oarbman@kth.se) +- Dania Sami (dsami@kth.se) + +## Deadline +- Week 4 + +## Topic +- MLOps / AIOps / LLMOps + +## Category +- Scientific paper + +## Description +This paper presents an empirical study on GPU underutilization in deep learning training jobs, a critical challenge for both MLOps and large-scale ML systems. The authors analyze more than 400 real-world jobs, of which nearly 50% exhibited GPU utilization ≤50%. They uncover 706 issues related to data pipelines, model configurations, and resource scheduling. + +The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to **7.5× performance improvements**. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. + +**Link to article:** +[An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024)](https://www.microsoft.com/en-us/research/publication/an-empirical-study-on-low-gpu-utilization-of-deep-learning-jobs/) + +## Relevance +This paper is highly relevant to MLOps/AIOps/LLMOps, since efficient GPU usage is central to deploying ML systems in production. Low GPU utilization directly affects training time, operational cost, and scalability of ML workloads. + +By analyzing real-world operational challenges and offering a framework for diagnosing and fixing inefficiencies, this work provides practical insights for DevOps engineers and ML practitioners. It connects directly to the course theme of DevOps for ML systems, emphasizing resource management, pipeline reliability, and system-level optimization. From a9776b2255a31a4e369e46d3ef5dc076f0cb85c1 Mon Sep 17 00:00:00 2001 From: dania-sami <147648301+dania-sami@users.noreply.github.com> Date: Tue, 9 Sep 2025 15:44:27 +0200 Subject: [PATCH 04/16] Update oarbman-dsami.md --- contributions/scientific-paper/week4/oarbman-dsami.md | 6 ------ 1 file changed, 6 deletions(-) diff --git a/contributions/scientific-paper/week4/oarbman-dsami.md b/contributions/scientific-paper/week4/oarbman-dsami.md index ca841e2ecd..2da45b841e 100644 --- a/contributions/scientific-paper/week4/oarbman-dsami.md +++ b/contributions/scientific-paper/week4/oarbman-dsami.md @@ -10,8 +10,6 @@ An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024) ## Deadline - Week 4 -## Topic -- MLOps / AIOps / LLMOps ## Category - Scientific paper @@ -24,7 +22,3 @@ The study shows that relatively simple fixes such as improving I/O pipelines or **Link to article:** [An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024)](https://www.microsoft.com/en-us/research/publication/an-empirical-study-on-low-gpu-utilization-of-deep-learning-jobs/) -## Relevance -This paper is highly relevant to MLOps/AIOps/LLMOps, since efficient GPU usage is central to deploying ML systems in production. Low GPU utilization directly affects training time, operational cost, and scalability of ML workloads. - -By analyzing real-world operational challenges and offering a framework for diagnosing and fixing inefficiencies, this work provides practical insights for DevOps engineers and ML practitioners. It connects directly to the course theme of DevOps for ML systems, emphasizing resource management, pipeline reliability, and system-level optimization. From cbed3f01c796b7db5200f43644943dfc294da1aa Mon Sep 17 00:00:00 2001 From: dania-sami <147648301+dania-sami@users.noreply.github.com> Date: Tue, 9 Sep 2025 15:57:15 +0200 Subject: [PATCH 05/16] Fix formatting of Week 4 proposal (remove Topic, match required headings) --- .../scientific-paper/week4/oarbman-dsami.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/contributions/scientific-paper/week4/oarbman-dsami.md b/contributions/scientific-paper/week4/oarbman-dsami.md index 2da45b841e..b0db510a40 100644 --- a/contributions/scientific-paper/week4/oarbman-dsami.md +++ b/contributions/scientific-paper/week4/oarbman-dsami.md @@ -4,21 +4,21 @@ An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024) ## Names and KTH ID -- Oscar Arbman (oarbman@kth.se) +- Oscar Arbman (oarbman@kth.se) - Dania Sami (dsami@kth.se) ## Deadline - Week 4 - ## Category - Scientific paper ## Description -This paper presents an empirical study on GPU underutilization in deep learning training jobs, a critical challenge for both MLOps and large-scale ML systems. The authors analyze more than 400 real-world jobs, of which nearly 50% exhibited GPU utilization ≤50%. They uncover 706 issues related to data pipelines, model configurations, and resource scheduling. +This paper presents an empirical study on GPU underutilization in deep learning training jobs, a critical challenge for both MLOps and large-scale ML systems. The authors analyze more than 400 real-world jobs, of which nearly 50% exhibited GPU utilization ≤50%. They uncover 706 issues related to data pipelines, model configurations, and resource scheduling. -The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to **7.5× performance improvements**. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. +The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to **7.5× performance improvements**. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. -**Link to article:** -[An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024)](https://www.microsoft.com/en-us/research/publication/an-empirical-study-on-low-gpu-utilization-of-deep-learning-jobs/) +## Relevance +This paper is highly relevant to MLOps/AIOps/LLMOps, since efficient GPU usage is central to deploying ML systems in production. Low GPU utilization directly affects training time, operational cost, and scalability of ML workloads. +By analyzing real-world operational challenges and offering a framework for diagnosing and fixing inefficiencies, this work provides practical insights for DevOps engineers and ML practitioners. It connects directly to the course theme of DevOps for ML systems, emphasizing resource management, pipeline reliability, and system-level optimization. From 1dcd888ed88182bebf31ab8c3b5e6aa964632ebc Mon Sep 17 00:00:00 2001 From: dania-sami <147648301+dania-sami@users.noreply.github.com> Date: Tue, 9 Sep 2025 16:07:26 +0200 Subject: [PATCH 06/16] Update oarbman-dsami.md --- contributions/scientific-paper/week4/oarbman-dsami.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/contributions/scientific-paper/week4/oarbman-dsami.md b/contributions/scientific-paper/week4/oarbman-dsami.md index b0db510a40..f4f5142930 100644 --- a/contributions/scientific-paper/week4/oarbman-dsami.md +++ b/contributions/scientific-paper/week4/oarbman-dsami.md @@ -18,7 +18,8 @@ This paper presents an empirical study on GPU underutilization in deep learning The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to **7.5× performance improvements**. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. -## Relevance +Relevance + This paper is highly relevant to MLOps/AIOps/LLMOps, since efficient GPU usage is central to deploying ML systems in production. Low GPU utilization directly affects training time, operational cost, and scalability of ML workloads. By analyzing real-world operational challenges and offering a framework for diagnosing and fixing inefficiencies, this work provides practical insights for DevOps engineers and ML practitioners. It connects directly to the course theme of DevOps for ML systems, emphasizing resource management, pipeline reliability, and system-level optimization. From f16ba855e7a1ed184b778063e8a294fa57059a43 Mon Sep 17 00:00:00 2001 From: dania-sami <147648301+dania-sami@users.noreply.github.com> Date: Tue, 9 Sep 2025 16:08:04 +0200 Subject: [PATCH 07/16] Update README.md --- contributions/scientific-paper/week4/oarbman-dsami/README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/contributions/scientific-paper/week4/oarbman-dsami/README.md b/contributions/scientific-paper/week4/oarbman-dsami/README.md index ca841e2ecd..6d833703a8 100644 --- a/contributions/scientific-paper/week4/oarbman-dsami/README.md +++ b/contributions/scientific-paper/week4/oarbman-dsami/README.md @@ -24,7 +24,8 @@ The study shows that relatively simple fixes such as improving I/O pipelines or **Link to article:** [An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024)](https://www.microsoft.com/en-us/research/publication/an-empirical-study-on-low-gpu-utilization-of-deep-learning-jobs/) -## Relevance + Relevance + This paper is highly relevant to MLOps/AIOps/LLMOps, since efficient GPU usage is central to deploying ML systems in production. Low GPU utilization directly affects training time, operational cost, and scalability of ML workloads. By analyzing real-world operational challenges and offering a framework for diagnosing and fixing inefficiencies, this work provides practical insights for DevOps engineers and ML practitioners. It connects directly to the course theme of DevOps for ML systems, emphasizing resource management, pipeline reliability, and system-level optimization. From aea75f574f97048f0946cf189f83b91b945071c1 Mon Sep 17 00:00:00 2001 From: dania-sami <147648301+dania-sami@users.noreply.github.com> Date: Tue, 9 Sep 2025 16:22:20 +0200 Subject: [PATCH 08/16] Update README.md --- .../week4/oarbman-dsami/README.md | 39 ++++++++----------- 1 file changed, 16 insertions(+), 23 deletions(-) diff --git a/contributions/scientific-paper/week4/oarbman-dsami/README.md b/contributions/scientific-paper/week4/oarbman-dsami/README.md index 6d833703a8..1cbc0a0fb3 100644 --- a/contributions/scientific-paper/week4/oarbman-dsami/README.md +++ b/contributions/scientific-paper/week4/oarbman-dsami/README.md @@ -1,31 +1,24 @@ -# Assignment Proposal +Assignment Proposal +Title +An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024) -## Title -An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024) +Names and KTH ID +Oscar Arbman (oarbman@kth.se) +Dania Sami (dsami@kth.se) -## Names and KTH ID -- Oscar Arbman (oarbman@kth.se) -- Dania Sami (dsami@kth.se) +Deadline +Week 4 -## Deadline -- Week 4 +Category +Scientific paper -## Topic -- MLOps / AIOps / LLMOps - -## Category -- Scientific paper - -## Description +Description This paper presents an empirical study on GPU underutilization in deep learning training jobs, a critical challenge for both MLOps and large-scale ML systems. The authors analyze more than 400 real-world jobs, of which nearly 50% exhibited GPU utilization ≤50%. They uncover 706 issues related to data pipelines, model configurations, and resource scheduling. -The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to **7.5× performance improvements**. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. - -**Link to article:** -[An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024)](https://www.microsoft.com/en-us/research/publication/an-empirical-study-on-low-gpu-utilization-of-deep-learning-jobs/) +The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to 7.5× performance improvements. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. - Relevance - -This paper is highly relevant to MLOps/AIOps/LLMOps, since efficient GPU usage is central to deploying ML systems in production. Low GPU utilization directly affects training time, operational cost, and scalability of ML workloads. +The full paper can be accessed here: +https://www.microsoft.com/en-us/research/publication/an-empirical-study-on-low-gpu-utilization-of-deep-learning-jobs/ -By analyzing real-world operational challenges and offering a framework for diagnosing and fixing inefficiencies, this work provides practical insights for DevOps engineers and ML practitioners. It connects directly to the course theme of DevOps for ML systems, emphasizing resource management, pipeline reliability, and system-level optimization. +Relevance +This paper is highly relevant to MLOps/AIOps/LLMOps, since efficient GPU usage is central to deploying ML systems in production. Low GPU utilization directly affects training time, operational cost, and scalability of ML workloads. By analyzing real-world operational challenges and offering a framework for diagnosing and fixing inefficiencies, this work provides practical insights for DevOps engineers and ML practitioners. It connects directly to the course theme of DevOps for ML systems, emphasizing resource management, pipeline reliability, and system-level optimization. From 95ce04339f3cf5f28609f3402b49aa70025147eb Mon Sep 17 00:00:00 2001 From: dania-sami <147648301+dania-sami@users.noreply.github.com> Date: Tue, 9 Sep 2025 16:22:40 +0200 Subject: [PATCH 09/16] Update oarbman-dsami.md --- .../scientific-paper/week4/oarbman-dsami.md | 35 +++++++++---------- 1 file changed, 17 insertions(+), 18 deletions(-) diff --git a/contributions/scientific-paper/week4/oarbman-dsami.md b/contributions/scientific-paper/week4/oarbman-dsami.md index f4f5142930..1cbc0a0fb3 100644 --- a/contributions/scientific-paper/week4/oarbman-dsami.md +++ b/contributions/scientific-paper/week4/oarbman-dsami.md @@ -1,25 +1,24 @@ -# Assignment Proposal +Assignment Proposal +Title +An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024) -## Title -An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024) +Names and KTH ID +Oscar Arbman (oarbman@kth.se) +Dania Sami (dsami@kth.se) -## Names and KTH ID -- Oscar Arbman (oarbman@kth.se) -- Dania Sami (dsami@kth.se) +Deadline +Week 4 -## Deadline -- Week 4 +Category +Scientific paper -## Category -- Scientific paper +Description +This paper presents an empirical study on GPU underutilization in deep learning training jobs, a critical challenge for both MLOps and large-scale ML systems. The authors analyze more than 400 real-world jobs, of which nearly 50% exhibited GPU utilization ≤50%. They uncover 706 issues related to data pipelines, model configurations, and resource scheduling. -## Description -This paper presents an empirical study on GPU underutilization in deep learning training jobs, a critical challenge for both MLOps and large-scale ML systems. The authors analyze more than 400 real-world jobs, of which nearly 50% exhibited GPU utilization ≤50%. They uncover 706 issues related to data pipelines, model configurations, and resource scheduling. +The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to 7.5× performance improvements. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. -The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to **7.5× performance improvements**. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. +The full paper can be accessed here: +https://www.microsoft.com/en-us/research/publication/an-empirical-study-on-low-gpu-utilization-of-deep-learning-jobs/ -Relevance - -This paper is highly relevant to MLOps/AIOps/LLMOps, since efficient GPU usage is central to deploying ML systems in production. Low GPU utilization directly affects training time, operational cost, and scalability of ML workloads. - -By analyzing real-world operational challenges and offering a framework for diagnosing and fixing inefficiencies, this work provides practical insights for DevOps engineers and ML practitioners. It connects directly to the course theme of DevOps for ML systems, emphasizing resource management, pipeline reliability, and system-level optimization. +Relevance +This paper is highly relevant to MLOps/AIOps/LLMOps, since efficient GPU usage is central to deploying ML systems in production. Low GPU utilization directly affects training time, operational cost, and scalability of ML workloads. By analyzing real-world operational challenges and offering a framework for diagnosing and fixing inefficiencies, this work provides practical insights for DevOps engineers and ML practitioners. It connects directly to the course theme of DevOps for ML systems, emphasizing resource management, pipeline reliability, and system-level optimization. From 9930a97167f830692eb84e926d560926b907ccc8 Mon Sep 17 00:00:00 2001 From: dania-sami <147648301+dania-sami@users.noreply.github.com> Date: Tue, 9 Sep 2025 16:31:03 +0200 Subject: [PATCH 10/16] Update oarbman-dsami.md --- .../scientific-paper/week4/oarbman-dsami.md | 28 ++++++++++--------- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/contributions/scientific-paper/week4/oarbman-dsami.md b/contributions/scientific-paper/week4/oarbman-dsami.md index 1cbc0a0fb3..8a62f461e6 100644 --- a/contributions/scientific-paper/week4/oarbman-dsami.md +++ b/contributions/scientific-paper/week4/oarbman-dsami.md @@ -1,24 +1,26 @@ -Assignment Proposal -Title -An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024) +# Assignment Proposal -Names and KTH ID -Oscar Arbman (oarbman@kth.se) -Dania Sami (dsami@kth.se) +## Title +An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024) -Deadline -Week 4 +## Names and KTH ID +- Oscar Arbman (oarbman@kth.se) +- Dania Sami (dsami@kth.se) -Category -Scientific paper +## Deadline +- Week 4 -Description +## Category +- Scientific paper + +## Description This paper presents an empirical study on GPU underutilization in deep learning training jobs, a critical challenge for both MLOps and large-scale ML systems. The authors analyze more than 400 real-world jobs, of which nearly 50% exhibited GPU utilization ≤50%. They uncover 706 issues related to data pipelines, model configurations, and resource scheduling. -The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to 7.5× performance improvements. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. +The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to **7.5× performance improvements**. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. The full paper can be accessed here: https://www.microsoft.com/en-us/research/publication/an-empirical-study-on-low-gpu-utilization-of-deep-learning-jobs/ -Relevance +Relevance + This paper is highly relevant to MLOps/AIOps/LLMOps, since efficient GPU usage is central to deploying ML systems in production. Low GPU utilization directly affects training time, operational cost, and scalability of ML workloads. By analyzing real-world operational challenges and offering a framework for diagnosing and fixing inefficiencies, this work provides practical insights for DevOps engineers and ML practitioners. It connects directly to the course theme of DevOps for ML systems, emphasizing resource management, pipeline reliability, and system-level optimization. From 194ef180899850bd405966c5b248a04409749aa0 Mon Sep 17 00:00:00 2001 From: dania-sami <147648301+dania-sami@users.noreply.github.com> Date: Tue, 9 Sep 2025 16:31:33 +0200 Subject: [PATCH 11/16] Update README.md --- .../week4/oarbman-dsami/README.md | 28 ++++++++++--------- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/contributions/scientific-paper/week4/oarbman-dsami/README.md b/contributions/scientific-paper/week4/oarbman-dsami/README.md index 1cbc0a0fb3..8a62f461e6 100644 --- a/contributions/scientific-paper/week4/oarbman-dsami/README.md +++ b/contributions/scientific-paper/week4/oarbman-dsami/README.md @@ -1,24 +1,26 @@ -Assignment Proposal -Title -An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024) +# Assignment Proposal -Names and KTH ID -Oscar Arbman (oarbman@kth.se) -Dania Sami (dsami@kth.se) +## Title +An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024) -Deadline -Week 4 +## Names and KTH ID +- Oscar Arbman (oarbman@kth.se) +- Dania Sami (dsami@kth.se) -Category -Scientific paper +## Deadline +- Week 4 -Description +## Category +- Scientific paper + +## Description This paper presents an empirical study on GPU underutilization in deep learning training jobs, a critical challenge for both MLOps and large-scale ML systems. The authors analyze more than 400 real-world jobs, of which nearly 50% exhibited GPU utilization ≤50%. They uncover 706 issues related to data pipelines, model configurations, and resource scheduling. -The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to 7.5× performance improvements. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. +The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to **7.5× performance improvements**. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. The full paper can be accessed here: https://www.microsoft.com/en-us/research/publication/an-empirical-study-on-low-gpu-utilization-of-deep-learning-jobs/ -Relevance +Relevance + This paper is highly relevant to MLOps/AIOps/LLMOps, since efficient GPU usage is central to deploying ML systems in production. Low GPU utilization directly affects training time, operational cost, and scalability of ML workloads. By analyzing real-world operational challenges and offering a framework for diagnosing and fixing inefficiencies, this work provides practical insights for DevOps engineers and ML practitioners. It connects directly to the course theme of DevOps for ML systems, emphasizing resource management, pipeline reliability, and system-level optimization. From 9a2dc33d23923a78fc2775ad4ef938ded5a827af Mon Sep 17 00:00:00 2001 From: dania-sami <147648301+dania-sami@users.noreply.github.com> Date: Tue, 9 Sep 2025 16:36:46 +0200 Subject: [PATCH 12/16] Update README.md --- .../week4/oarbman-dsami/README.md | 30 +++++++++---------- 1 file changed, 14 insertions(+), 16 deletions(-) diff --git a/contributions/scientific-paper/week4/oarbman-dsami/README.md b/contributions/scientific-paper/week4/oarbman-dsami/README.md index 8a62f461e6..48e0cc0ba0 100644 --- a/contributions/scientific-paper/week4/oarbman-dsami/README.md +++ b/contributions/scientific-paper/week4/oarbman-dsami/README.md @@ -1,26 +1,24 @@ -# Assignment Proposal - -## Title +Assignment Proposal +Title An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024) -## Names and KTH ID -- Oscar Arbman (oarbman@kth.se) -- Dania Sami (dsami@kth.se) +Names and KTH ID +Oscar Arbman (oarbman@kth.se) +Dania Sami (dsami@kth.se) -## Deadline -- Week 4 +Deadline +Week 4 -## Category -- Scientific paper +Category +Scientific paper -## Description -This paper presents an empirical study on GPU underutilization in deep learning training jobs, a critical challenge for both MLOps and large-scale ML systems. The authors analyze more than 400 real-world jobs, of which nearly 50% exhibited GPU utilization ≤50%. They uncover 706 issues related to data pipelines, model configurations, and resource scheduling. +Description +This paper presents an empirical study on GPU underutilization in deep learning training jobs, a critical challenge for both MLOps and large-scale ML systems. The authors analyze more than 400 real-world jobs, of which nearly 50% exhibited GPU utilization ≤50%. They uncover 706 issues related to data pipelines, model configurations, and resource scheduling. -The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to **7.5× performance improvements**. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. +The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to 7.5× performance improvements. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. -The full paper can be accessed here: -https://www.microsoft.com/en-us/research/publication/an-empirical-study-on-low-gpu-utilization-of-deep-learning-jobs/ +The full paper can be accessed here: +https://www.microsoft.com/en-us/research/publication/an-empirical-study-on-low-gpu-utilization-of-deep-learning-jobs/ Relevance - This paper is highly relevant to MLOps/AIOps/LLMOps, since efficient GPU usage is central to deploying ML systems in production. Low GPU utilization directly affects training time, operational cost, and scalability of ML workloads. By analyzing real-world operational challenges and offering a framework for diagnosing and fixing inefficiencies, this work provides practical insights for DevOps engineers and ML practitioners. It connects directly to the course theme of DevOps for ML systems, emphasizing resource management, pipeline reliability, and system-level optimization. From f9ea183fefe19dea412815f5416a60b94a5ca924 Mon Sep 17 00:00:00 2001 From: dania-sami <147648301+dania-sami@users.noreply.github.com> Date: Tue, 9 Sep 2025 16:49:11 +0200 Subject: [PATCH 13/16] Update README.md --- .../week4/oarbman-dsami/README.md | 39 ++++++++++++------- 1 file changed, 24 insertions(+), 15 deletions(-) diff --git a/contributions/scientific-paper/week4/oarbman-dsami/README.md b/contributions/scientific-paper/week4/oarbman-dsami/README.md index 48e0cc0ba0..6798ae8fb1 100644 --- a/contributions/scientific-paper/week4/oarbman-dsami/README.md +++ b/contributions/scientific-paper/week4/oarbman-dsami/README.md @@ -1,24 +1,33 @@ -Assignment Proposal -Title +# Assignment Proposal + +## Title + An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024) -Names and KTH ID -Oscar Arbman (oarbman@kth.se) -Dania Sami (dsami@kth.se) +## Names and KTH ID + +- Oscar Arbman (oarbman@kth.se) + +- Dania Sami (dsami@kth.se) + +## Deadline + +- Week 4 + +- Topic: MLOps/AIOps/LLMOps + +## Category + +- Scientific paper -Deadline -Week 4 +## Description -Category -Scientific paper +This paper presents an empirical study on GPU underutilization in deep learning training jobs, a critical challenge for both MLOps and large-scale ML systems. The authors analyze more than 400 real-world jobs, of which nearly 50% exhibited GPU utilization ≤50%. They uncover 706 issues related to data pipelines, model configurations, and resource scheduling. -Description -This paper presents an empirical study on GPU underutilization in deep learning training jobs, a critical challenge for both MLOps and large-scale ML systems. The authors analyze more than 400 real-world jobs, of which nearly 50% exhibited GPU utilization ≤50%. They uncover 706 issues related to data pipelines, model configurations, and resource scheduling. +The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to **7.5× performance improvements**. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. -The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to 7.5× performance improvements. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. +- Link to article: https://www.microsoft.com/en-us/research/publication/an-empirical-study-on-low-gpu-utilization-of-deep-learning-jobs/ -The full paper can be accessed here: -https://www.microsoft.com/en-us/research/publication/an-empirical-study-on-low-gpu-utilization-of-deep-learning-jobs/ +**Relevance** -Relevance This paper is highly relevant to MLOps/AIOps/LLMOps, since efficient GPU usage is central to deploying ML systems in production. Low GPU utilization directly affects training time, operational cost, and scalability of ML workloads. By analyzing real-world operational challenges and offering a framework for diagnosing and fixing inefficiencies, this work provides practical insights for DevOps engineers and ML practitioners. It connects directly to the course theme of DevOps for ML systems, emphasizing resource management, pipeline reliability, and system-level optimization. From 1495f08fe348aae319a33e15dbdbee86360ae6f9 Mon Sep 17 00:00:00 2001 From: dania-sami <147648301+dania-sami@users.noreply.github.com> Date: Tue, 9 Sep 2025 16:49:53 +0200 Subject: [PATCH 14/16] Update oarbman-dsami.md --- .../scientific-paper/week4/oarbman-dsami.md | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/contributions/scientific-paper/week4/oarbman-dsami.md b/contributions/scientific-paper/week4/oarbman-dsami.md index 8a62f461e6..6798ae8fb1 100644 --- a/contributions/scientific-paper/week4/oarbman-dsami.md +++ b/contributions/scientific-paper/week4/oarbman-dsami.md @@ -1,26 +1,33 @@ # Assignment Proposal ## Title + An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024) ## Names and KTH ID -- Oscar Arbman (oarbman@kth.se) + +- Oscar Arbman (oarbman@kth.se) + - Dania Sami (dsami@kth.se) ## Deadline + - Week 4 +- Topic: MLOps/AIOps/LLMOps + ## Category + - Scientific paper ## Description + This paper presents an empirical study on GPU underutilization in deep learning training jobs, a critical challenge for both MLOps and large-scale ML systems. The authors analyze more than 400 real-world jobs, of which nearly 50% exhibited GPU utilization ≤50%. They uncover 706 issues related to data pipelines, model configurations, and resource scheduling. -The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to **7.5× performance improvements**. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. +The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to **7.5× performance improvements**. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. -The full paper can be accessed here: -https://www.microsoft.com/en-us/research/publication/an-empirical-study-on-low-gpu-utilization-of-deep-learning-jobs/ +- Link to article: https://www.microsoft.com/en-us/research/publication/an-empirical-study-on-low-gpu-utilization-of-deep-learning-jobs/ -Relevance +**Relevance** This paper is highly relevant to MLOps/AIOps/LLMOps, since efficient GPU usage is central to deploying ML systems in production. Low GPU utilization directly affects training time, operational cost, and scalability of ML workloads. By analyzing real-world operational challenges and offering a framework for diagnosing and fixing inefficiencies, this work provides practical insights for DevOps engineers and ML practitioners. It connects directly to the course theme of DevOps for ML systems, emphasizing resource management, pipeline reliability, and system-level optimization. From d878421bdee4585e5648cb46fc14fe6d603a7d93 Mon Sep 17 00:00:00 2001 From: dania-sami <147648301+dania-sami@users.noreply.github.com> Date: Tue, 9 Sep 2025 18:31:30 +0200 Subject: [PATCH 15/16] Delete contributions/scientific-paper/week4/oarbman-dsami.md --- .../scientific-paper/week4/oarbman-dsami.md | 33 ------------------- 1 file changed, 33 deletions(-) delete mode 100644 contributions/scientific-paper/week4/oarbman-dsami.md diff --git a/contributions/scientific-paper/week4/oarbman-dsami.md b/contributions/scientific-paper/week4/oarbman-dsami.md deleted file mode 100644 index 6798ae8fb1..0000000000 --- a/contributions/scientific-paper/week4/oarbman-dsami.md +++ /dev/null @@ -1,33 +0,0 @@ -# Assignment Proposal - -## Title - -An Empirical Study on Low GPU Utilization of Deep Learning Jobs (ICSE 2024) - -## Names and KTH ID - -- Oscar Arbman (oarbman@kth.se) - -- Dania Sami (dsami@kth.se) - -## Deadline - -- Week 4 - -- Topic: MLOps/AIOps/LLMOps - -## Category - -- Scientific paper - -## Description - -This paper presents an empirical study on GPU underutilization in deep learning training jobs, a critical challenge for both MLOps and large-scale ML systems. The authors analyze more than 400 real-world jobs, of which nearly 50% exhibited GPU utilization ≤50%. They uncover 706 issues related to data pipelines, model configurations, and resource scheduling. - -The study shows that relatively simple fixes such as improving I/O pipelines or adjusting training configurations can yield up to **7.5× performance improvements**. This work provides actionable insights into how inefficiencies arise and how they can be mitigated to make ML pipelines more reliable and cost-efficient. - -- Link to article: https://www.microsoft.com/en-us/research/publication/an-empirical-study-on-low-gpu-utilization-of-deep-learning-jobs/ - -**Relevance** - -This paper is highly relevant to MLOps/AIOps/LLMOps, since efficient GPU usage is central to deploying ML systems in production. Low GPU utilization directly affects training time, operational cost, and scalability of ML workloads. By analyzing real-world operational challenges and offering a framework for diagnosing and fixing inefficiencies, this work provides practical insights for DevOps engineers and ML practitioners. It connects directly to the course theme of DevOps for ML systems, emphasizing resource management, pipeline reliability, and system-level optimization. From 5b73f35e28eb4f7479ba1b0fab3a158a7b4a456a Mon Sep 17 00:00:00 2001 From: Dania Sami Date: Wed, 8 Oct 2025 14:49:36 +0200 Subject: [PATCH 16/16] Added open source proposal for Docker Docs contribution --- .../open-source/dania-sami/README.md | 22 +++++++++++++++++++ 1 file changed, 22 insertions(+) create mode 100644 contributions/open-source/dania-sami/README.md diff --git a/contributions/open-source/dania-sami/README.md b/contributions/open-source/dania-sami/README.md new file mode 100644 index 0000000000..c45e79641b --- /dev/null +++ b/contributions/open-source/dania-sami/README.md @@ -0,0 +1,22 @@ +# Assignment Proposal + +## Title +Adding an Example for Running Multiple Services with Docker Compose + +## Name and KTH ID +- Dania Sami (dsami@.kth.se) + +## Deadline +- Task 3 + +## Category +- Open Source + +## Description +I plan to contribute to the Docker open-source project by improving its documentation. Specifically, I will add a new example that demonstrates how to run multiple services (such as a frontend and backend) using a single `docker-compose.yml` file. + +This addition will make it easier for new developers to understand multi-service setups, a key concept in modern DevOps workflows. It involves editing the `content/compose/gettingstarted.md` file in the Docker Docs repository and submitting a pull request to merge the update. + +## Relevance +This contribution is relevant to DevOps because Docker Compose is widely used for managing multi-container applications in CI/CD pipelines and local development environments. The example improves documentation clarity, reproducibility, and developer experience, supporting better automation practices in DevOps. +