Hello Marios,
Regarding Zhiheng's concern (thanks for pointing this out!), my understanding is that large models can still be split across multiple GPUs without NVLink. While NVLink offers better efficiency in certain cases, it is not strictly required for multi-GPU training. For more details, please refer to https://huggingface.co/docs/transformers/perf_train_gpu_many.
In short, NVLink merely serves as a dedicated communication channel between GPUs in addition to PCIe (within a single physical server). However, I believe it should not be a critical factor when selecting GPU models.
Best regards, Cyril
Get Outlook for Androidhttps://aka.ms/AAb9ysg
________________________________ From: José Zerna Torres j.e.zernatorres@uva.nl Sent: Saturday, December 21, 2024 12:07:30 AM To: Marios Avgeris m.avgeris@uva.nl Cc: Zhiheng Yang z.yang@uva.nl; prognets@list.uva.nl prognets@list.uva.nl Subject: [Prognets] Re: Purchase of New Graphics Cards
Hi Marios.
I agree with Zhiheng, they are all good GPUs. I have not worked with these, however after checking the specs for the options you have shared with us I have the following comments:
* I believe the NVIDIA H100 Tensor Core GPU is the most powerful option, and has the higher performance out of the 3. Also it is designed with the latest Hopper architecture from NVIDIA. This also makes it a little more "future-proof", so it could be a good long-term investment for our team. However, we might need to evaluate the infrastructure capabilites (and potential adaptations if needed), as it seems to require robust cooling and power systems. Here's a reference regarding the cooling requirements: https://massedcompute.com/faq-answers/?question=What%20are%20the%20cooling%2... * NVIDIA A40 seems to have a good balance of compute and memory for AI workloads, so it could be a great middle-ground option. It seems to be very efficient with shared workloads and have great virtualization support, so it might be a good option. * NVIDIA RTX 6000 Ada Generation: If I am not mistaken this one is primarily designed for desktop workstations rather that data center deployment. It has a lower performance compared to H100 and should have limitations with large-scale distributed training.
Given the research requirements of our team, I would recommend either the H100 (if budget and infrastructure allow) or the A40 as an alternative.
Best regards,
José
________________________________ From: Zhiheng Yang z.yang@uva.nl Sent: Friday, 20 December 2024 15:59 To: Marios Avgeris m.avgeris@uva.nl; prognets@list.uva.nl prognets@list.uva.nl Subject: [Prognets] Re: Purchase of New Graphics Cards
Hi Marios,
Thanks for sharing, it is great to hear that we are going to have more GPUs!
I think they are all good GPUs. I have used some older versions like A100 and A6000 (not Ada) but not these new versions.
Just a quick note regarding NVIDIA 6000 Ada just in case if it is helpful. It doesn’t support NVLink, meaning GPU memory cannot be pooled across multiple cards. This could limit the ability to handle large model training or inference. For multi-GPU setups, its advantage might be less significant. (Btw it is really weird that the new 6000 Ada canceled NVLink support…)
Best,
Zhiheng
From: Marios Avgeris m.avgeris@uva.nl Date: Friday, December 20, 2024 at 14:22 To: prognets@list.uva.nl prognets@list.uva.nl Subject: [Prognets] Purchase of New Graphics Cards
Hi team,
I hope you are all doing well and preparing for your winter break.
One last request before we say goodbye for a while; We are planning to order some graphics cards of the following types:
1. NVIDIA RTX 6000 Ada https://www.nvidia.com/en-us/design-visualization/rtx-6000/
2. NVIDIA A40 https://www.nvidia.com/en-us/data-center/a40/
3. NVIDIA H100 Tensor Core GPU https://www.nvidia.com/en-us/data-center/h100/
Has anyone worked with any of them (or similar)? If so, do you have any feedback?
Do you have any objections/other suggestions?
We would greatly appreciate any comments before the end of the day, as we are planning to finalize the equipment list before we leave.
Best,
Marios