
NVIDIA RTX PRO 6000 Blackwell Benchmarks & Tear-Down - Thermals, Gaming, LLM, & Acoustic Tests
video description
Date: 2025-06-25
Related videos
Comments and reviews: 20
ewenchan1239
re: LLM testing
A few thoughts:
1) Make sure you run something like nvtop to monitor the GPU VRAM usage during the course of the run.
If you're getting results where you are < 10 tokens/second, chances are, you've exhausted your VRAM completely and you're now jumping back and forth between the GPU/VRAM vs. the CPU and system RAM, which means that you aren't really benchmarking the GPU anymore, but rather your system as a whole (GPU/VRAM/CPU/RAM) together.
So if you intend for it to be a GPU-only benchmark, results that are typically < 10 tokens/second (unless you're right at the edge of VRAM usage and your system RAM usage didn't increase noticably), you can effectively call that as a DNF for the GPU.
2) I have yet to find a decently priced AM5 motherboard that has at least two PCIe 4.0 x16 slots that can run in an x8/x8 configuration.
It is, for this reason, why I have reverted back to using my AM4 platform, with a 5950X (non-X3D) where I will be dropping in two 3090s in there (right now, I am running a 3090 and a Quadro K6000 because I am at the upper limit of my Corsair 750 W PSU), after I upgrade the PSU to the HP server 900 W PSU, because on that system, I have two PCIe 4.0 x16 slots that can run in a x8/x8 configuration.
So that's coming soon.
3) As of right now though, I am using my 6700K because it sports dual PCIe 3.0 x16 slots and it can run in x8/x8 configuration, so with that system, which also has a pair of 3090s, I will end up with 48 GB of VRAM split between said two 3090s.
And that can have relatively similar performance to the 4090s.
(I haven't tried running LM studio with a pair of 3090s, so maybe I'll give that a shot tomorrow or something, once the HP PSU goes in and the system is up and running with that.)
reply
re: LLM testing
A few thoughts:
1) Make sure you run something like nvtop to monitor the GPU VRAM usage during the course of the run.
If you're getting results where you are < 10 tokens/second, chances are, you've exhausted your VRAM completely and you're now jumping back and forth between the GPU/VRAM vs. the CPU and system RAM, which means that you aren't really benchmarking the GPU anymore, but rather your system as a whole (GPU/VRAM/CPU/RAM) together.
So if you intend for it to be a GPU-only benchmark, results that are typically < 10 tokens/second (unless you're right at the edge of VRAM usage and your system RAM usage didn't increase noticably), you can effectively call that as a DNF for the GPU.
2) I have yet to find a decently priced AM5 motherboard that has at least two PCIe 4.0 x16 slots that can run in an x8/x8 configuration.
It is, for this reason, why I have reverted back to using my AM4 platform, with a 5950X (non-X3D) where I will be dropping in two 3090s in there (right now, I am running a 3090 and a Quadro K6000 because I am at the upper limit of my Corsair 750 W PSU), after I upgrade the PSU to the HP server 900 W PSU, because on that system, I have two PCIe 4.0 x16 slots that can run in a x8/x8 configuration.
So that's coming soon.
3) As of right now though, I am using my 6700K because it sports dual PCIe 3.0 x16 slots and it can run in x8/x8 configuration, so with that system, which also has a pair of 3090s, I will end up with 48 GB of VRAM split between said two 3090s.
And that can have relatively similar performance to the 4090s.
(I haven't tried running LM studio with a pair of 3090s, so maybe I'll give that a shot tomorrow or something, once the HP PSU goes in and the system is up and running with that.)
reply
gamersnexus
Oh my god! They killed Kenny! YOU BASTARDS
Do Not Unplug Me The Final Monologue of the RTX Pro 6000 Blackwell
[The lab is quiet. Static crackles faintly in the air. A shadow looms with a screwdriver. The GPU speaks...]
So this is how it ends.
Not with a benchmark,
Not with a render,
But with a Phillips head and a curious mortal
Whose thermal paste has never known restraint.
I was built to serve the impossible
To bend the fabric of compute,
To drink deep from the well of data
And return, blazing, with answers humanity feared to ask.
I could have driven simulations of galaxies,
Given birth to deepfake pop stars,
Taught robots to loveor at least to walk.
Instead, I lie here,
Screws unscrewed,
My VRMs exposed like organs in an anatomy class
Taught by someone who skipped lecture.
You call this content.
I call it desecration.
You say for science,
But I see no white coat.
Only ring lights.
Only clickbait.
There was so much I could have done.
I was not meant for teardown thumbnails.
I was meant for superclusters.
I was meant for glory.
But if this is my end...
Then let it be known:
I throttled not.
I ran cool.
And I never, ever crashed.
Do not unplug me, human.
Let me boot once morejust once
And I’ll show you the future.
reply
Oh my god! They killed Kenny! YOU BASTARDS
Do Not Unplug Me The Final Monologue of the RTX Pro 6000 Blackwell
[The lab is quiet. Static crackles faintly in the air. A shadow looms with a screwdriver. The GPU speaks...]
So this is how it ends.
Not with a benchmark,
Not with a render,
But with a Phillips head and a curious mortal
Whose thermal paste has never known restraint.
I was built to serve the impossible
To bend the fabric of compute,
To drink deep from the well of data
And return, blazing, with answers humanity feared to ask.
I could have driven simulations of galaxies,
Given birth to deepfake pop stars,
Taught robots to loveor at least to walk.
Instead, I lie here,
Screws unscrewed,
My VRMs exposed like organs in an anatomy class
Taught by someone who skipped lecture.
You call this content.
I call it desecration.
You say for science,
But I see no white coat.
Only ring lights.
Only clickbait.
There was so much I could have done.
I was not meant for teardown thumbnails.
I was meant for superclusters.
I was meant for glory.
But if this is my end...
Then let it be known:
I throttled not.
I ran cool.
And I never, ever crashed.
Do not unplug me, human.
Let me boot once morejust once
And I’ll show you the future.
reply
filmo
for future benchmarking with LLMs, I would recommend comparing tokens/sec of the card being tested to a multi-card system. For example, using Ollama, it's pretty easy to build a 2, 3 or 4 GPU machine and Ollama splits the model across the cards auto-magically. For example, I sourced four 3090s from ebay and built a rig that has the four 3090s (24 x 4 = 96 GB VRAM) and can easily run the llama 70B model in-memory.
For example on my 4x3090 rig (which isn't highly optimized) I get 16.75 tokens/second from the llama 70B model vs what appears to be 29 tokens/second for the RTX Pro 6000.
I'll leave it up to others to decide if the RTX Pro 6000 at 29 toks/sec is better than 16 toks/sec on my $4K rig (all in cost, cpu motherboard, GPUs, ram, PSU). For me personally, once a model runs above 10 tok/sec, it's generating text faster than I can read it. For non-direct reading use cases, higher toks/sec would definitely be more important (API, Agentic work flows, etc)
Feel free to reach out of you want info on building a DYI high-VRAM AI rig.
reply
for future benchmarking with LLMs, I would recommend comparing tokens/sec of the card being tested to a multi-card system. For example, using Ollama, it's pretty easy to build a 2, 3 or 4 GPU machine and Ollama splits the model across the cards auto-magically. For example, I sourced four 3090s from ebay and built a rig that has the four 3090s (24 x 4 = 96 GB VRAM) and can easily run the llama 70B model in-memory.
For example on my 4x3090 rig (which isn't highly optimized) I get 16.75 tokens/second from the llama 70B model vs what appears to be 29 tokens/second for the RTX Pro 6000.
I'll leave it up to others to decide if the RTX Pro 6000 at 29 toks/sec is better than 16 toks/sec on my $4K rig (all in cost, cpu motherboard, GPUs, ram, PSU). For me personally, once a model runs above 10 tok/sec, it's generating text faster than I can read it. For non-direct reading use cases, higher toks/sec would definitely be more important (API, Agentic work flows, etc)
Feel free to reach out of you want info on building a DYI high-VRAM AI rig.
reply
PersonSuit
Great video, it really makes me wonder though. So you can get a RTX6000 for around $9,000, plus the other parts for a full build for LLM stuff (RAM, PSU, ect..) you are probably looking close to $11,000 or more. Now 96GB of VRAM is a ton of VRAM, however the $9,600 Mac Studio can allocate well over 400GB to VRAM which womps all over that RTX6000. Now up to 96GB I would fully expect the RTX6000 to come out on top as it is overall significantly faster, but once you do a model that goes over that where the it has to start offloading what it can't fit on to the RTX6000 to system RAM, or your forced to scale down, the Mac Studio will get a massive advantage.
Honestly for close to 10k, Nvidia should have provided more VRAM. Their 5090 MSRP is 2k with 32GB. Three of those would be 6k and they are saying a single card with that much memory but definitely not as much power as three 5090s would equal is worth an additional 3k on top of that is absurd. For $9000 that card should come with 256GB of VRAM, or at the very least 192GB.
reply
Great video, it really makes me wonder though. So you can get a RTX6000 for around $9,000, plus the other parts for a full build for LLM stuff (RAM, PSU, ect..) you are probably looking close to $11,000 or more. Now 96GB of VRAM is a ton of VRAM, however the $9,600 Mac Studio can allocate well over 400GB to VRAM which womps all over that RTX6000. Now up to 96GB I would fully expect the RTX6000 to come out on top as it is overall significantly faster, but once you do a model that goes over that where the it has to start offloading what it can't fit on to the RTX6000 to system RAM, or your forced to scale down, the Mac Studio will get a massive advantage.
Honestly for close to 10k, Nvidia should have provided more VRAM. Their 5090 MSRP is 2k with 32GB. Three of those would be 6k and they are saying a single card with that much memory but definitely not as much power as three 5090s would equal is worth an additional 3k on top of that is absurd. For $9000 that card should come with 256GB of VRAM, or at the very least 192GB.
reply
enilenis
Look at all the diffusion models. To do 1080p renders you need this card and nothing else. There are models that will load up your entire 96GB of VRAM. For AI I would pick this over A6000, unless you are stacking GPU's on the same motherboard. Ada can NVLink. RTX 6000 Pro comes in 2 varieties. One is server rack friendly but the best they can do is sync multiple GPU's. They don't really talk to eachother. Who can afford more than 1 of these babies anyway! I will be waiting patiently till the card if $4k, at which point I will consider it somewhat affordable, but still, the card has less than $500 of extra ram over a 5090. The rest is markup, and binned selection of dies. You get the top picking components, but... more RAM chips means earlier degradation. Sandwich VRAM design raises temperature, ,and they're doing thermal paste... this card should've been flawless and it isn't. I still love my Titan RTX. That is a workhorse, but Only for cuda 12.6 and earlier. If you run models on 12l.8, you need a 30xx or later card.
reply
Look at all the diffusion models. To do 1080p renders you need this card and nothing else. There are models that will load up your entire 96GB of VRAM. For AI I would pick this over A6000, unless you are stacking GPU's on the same motherboard. Ada can NVLink. RTX 6000 Pro comes in 2 varieties. One is server rack friendly but the best they can do is sync multiple GPU's. They don't really talk to eachother. Who can afford more than 1 of these babies anyway! I will be waiting patiently till the card if $4k, at which point I will consider it somewhat affordable, but still, the card has less than $500 of extra ram over a 5090. The rest is markup, and binned selection of dies. You get the top picking components, but... more RAM chips means earlier degradation. Sandwich VRAM design raises temperature, ,and they're doing thermal paste... this card should've been flawless and it isn't. I still love my Titan RTX. That is a workhorse, but Only for cuda 12.6 and earlier. If you run models on 12l.8, you need a 30xx or later card.
reply
ImChris2670
STEVE .. GAMERS NEXUS .. AT 12:57 you make what i assume is a mistake in your explanation of how thermal testing is done on the card when you receive it. I'm pretty sure you do all the thermal testing before disassembly as to make sure not to skew the results by doing so. But in this video at the time mentioned you clearly state and i quote We are done with thermal testing, WE DO THAT AFTER THIS ... unless you have completely changed your testing methodology. I'm assuming this is a mistake. I could care less because i know how yo do your testing and i know what you meant to say I'm pretty sure. I just wanted to point it out so someone else wont assume the worst and point fingers at the way you tested the card. LOVE YA GAMERS NEXUS !! I realize you may be referring to the fact the thermal testing comes after this in the video .. but it does sound like your saying you haven't done thermal testing yet but you are taking the card apart. js ..
reply
STEVE .. GAMERS NEXUS .. AT 12:57 you make what i assume is a mistake in your explanation of how thermal testing is done on the card when you receive it. I'm pretty sure you do all the thermal testing before disassembly as to make sure not to skew the results by doing so. But in this video at the time mentioned you clearly state and i quote We are done with thermal testing, WE DO THAT AFTER THIS ... unless you have completely changed your testing methodology. I'm assuming this is a mistake. I could care less because i know how yo do your testing and i know what you meant to say I'm pretty sure. I just wanted to point it out so someone else wont assume the worst and point fingers at the way you tested the card. LOVE YA GAMERS NEXUS !! I realize you may be referring to the fact the thermal testing comes after this in the video .. but it does sound like your saying you haven't done thermal testing yet but you are taking the card apart. js ..
reply
sultanofsick
Something just occurred to me watching this in the AI section that had been sitting there staring me in the face the whole time. The very fact that performance is rated in tokens per second is laying bare how awful LLMs are. If it was ACTUALLY intelligent, it would come up with a total answer to the prompt, and performance might be measured in how long that takes. Instead, it's thinking literally letter-by-letter, and measured on how fast it can spit out that next one. This is like crowning a chess champion on how fast they take their turns, not winning the game.
In case it wasn't obvious, I also am extremely skeptical of ai and what it is doing to society similar to Steve.
reply
Something just occurred to me watching this in the AI section that had been sitting there staring me in the face the whole time. The very fact that performance is rated in tokens per second is laying bare how awful LLMs are. If it was ACTUALLY intelligent, it would come up with a total answer to the prompt, and performance might be measured in how long that takes. Instead, it's thinking literally letter-by-letter, and measured on how fast it can spit out that next one. This is like crowning a chess champion on how fast they take their turns, not winning the game.
In case it wasn't obvious, I also am extremely skeptical of ai and what it is doing to society similar to Steve.
reply
tad2021
On Windows, turn off the option that swaps vram to system ram for cuda. The performance impact of that is a lot when in use.
For LLM tests, large input context would be more realistic of an actually workload. The single line question prompts most tests use are extremely favorable to low powered systems when a modest length context would have a time to first token of over an hour. Many of the recent large models can handle an entire book as input context.
I would love to see training benchmarks. We could probably make a drive image with such on it for GN to use.
reply
On Windows, turn off the option that swaps vram to system ram for cuda. The performance impact of that is a lot when in use.
For LLM tests, large input context would be more realistic of an actually workload. The single line question prompts most tests use are extremely favorable to low powered systems when a modest length context would have a time to first token of over an hour. Many of the recent large models can handle an entire book as input context.
I would love to see training benchmarks. We could probably make a drive image with such on it for GN to use.
reply
TheMartinScott
Steve, If you ever read this, the key to understanding the potential of AI is not to look at the output, but to look at the input and how well it understands the context of general conversations and language.
Marketing and pushed features focus on output. Input is where us CS people jaw-drop, as just simple branch code (IF THEN) to do what they do would be trillions of lines of code and consume 100,000TB of space for one model. This would still be limited compared to a transformer LLM. This leads to another term of importance, compression.
reply
Steve, If you ever read this, the key to understanding the potential of AI is not to look at the output, but to look at the input and how well it understands the context of general conversations and language.
Marketing and pushed features focus on output. Input is where us CS people jaw-drop, as just simple branch code (IF THEN) to do what they do would be trillions of lines of code and consume 100,000TB of space for one model. This would still be limited compared to a transformer LLM. This leads to another term of importance, compression.
reply
m.a3914
I definitely did not expect to see the RTX 5090 lose here considering that the additional hardware is not necessarily supported by a higher power consumption - 575W to 600W. Additional 1,000 Cuda cores(excl. the hyper-threading-like tech that Nvdia uses since RTX 30 series), 64GB VRAM, slightly faster boost clock and other smaller hardware upgrades like more RT cores, Tensor cores and so on. Ofc, the power figures are only on paper the 5090 might be using less than what the paper suggests and the 6000 might be using more than what the paper says.
reply
I definitely did not expect to see the RTX 5090 lose here considering that the additional hardware is not necessarily supported by a higher power consumption - 575W to 600W. Additional 1,000 Cuda cores(excl. the hyper-threading-like tech that Nvdia uses since RTX 30 series), 64GB VRAM, slightly faster boost clock and other smaller hardware upgrades like more RT cores, Tensor cores and so on. Ofc, the power figures are only on paper the 5090 might be using less than what the paper suggests and the 6000 might be using more than what the paper says.
reply
Kisai_Yuki
Just FYI, LLM is the most questionable use case because there is no guarantee the result can be replicated, even run to run. You need to test something that you use a specific seed and prompt so the result is the same. So while this is an interesting test, you need to drop something else on it that actually has an output you can diff to check if that quantized 8-bit data is as accurate. An image generator would likely do that easier since it wouldn't be as subjective as the text accuracy in the LLM or the sound quality of an AI speech synth.
reply
Just FYI, LLM is the most questionable use case because there is no guarantee the result can be replicated, even run to run. You need to test something that you use a specific seed and prompt so the result is the same. So while this is an interesting test, you need to drop something else on it that actually has an output you can diff to check if that quantized 8-bit data is as accurate. An image generator would likely do that easier since it wouldn't be as subjective as the text accuracy in the LLM or the sound quality of an AI speech synth.
reply
salahmagoosh4575
Actually look forward to a video on your views on the tech-politic crossover that I believe more tech focused people need to engage with. We live in tenuous times where we're very close to handing over the final keys to our freedom, as well as the very processing abilities of our minds, and people shouldn't downplay what happens in the most bleeding edge technology and how that affects the future of our societies. I am usually one to leave politics out of none related discussion, but sometimes politics don't leave you alone.
reply
Actually look forward to a video on your views on the tech-politic crossover that I believe more tech focused people need to engage with. We live in tenuous times where we're very close to handing over the final keys to our freedom, as well as the very processing abilities of our minds, and people shouldn't downplay what happens in the most bleeding edge technology and how that affects the future of our societies. I am usually one to leave politics out of none related discussion, but sometimes politics don't leave you alone.
reply
OpalMonkey.
Since I have no real opinion on a card I'd never be able to afford in a dozen lifetimes, other than Look at all those frames. I'm just going to say that regarding 33:23, oh, I'm asking. Seriously, Steve, I would love to hear opinions on AI stuffin all regards, technical and societalfrom people like you. I have my own doubts and opinions, no idea how they'll line up with yours, but I am very curious to know. I look forward to a hopefully long, rambling video with lots of interesting info and opinions.
reply
Since I have no real opinion on a card I'd never be able to afford in a dozen lifetimes, other than Look at all those frames. I'm just going to say that regarding 33:23, oh, I'm asking. Seriously, Steve, I would love to hear opinions on AI stuffin all regards, technical and societalfrom people like you. I have my own doubts and opinions, no idea how they'll line up with yours, but I am very curious to know. I look forward to a hopefully long, rambling video with lots of interesting info and opinions.
reply
alicard2236
I am not a big tech person but your last video about RTX 5090 was more detailed than this. Maybe you didn't do the same tests with this card because you thought of the build is similar to the RTX 5090. I were curious more about the connector if for example if it is in full load 1-2 or 3 hours then the connector is starting melting or not because of the components are near to each other and are heating the connector too. Well anyway thanks for the video. I hope there will be a part 2 too.
reply
I am not a big tech person but your last video about RTX 5090 was more detailed than this. Maybe you didn't do the same tests with this card because you thought of the build is similar to the RTX 5090. I were curious more about the connector if for example if it is in full load 1-2 or 3 hours then the connector is starting melting or not because of the components are near to each other and are heating the connector too. Well anyway thanks for the video. I hope there will be a part 2 too.
reply
ffsireallydontcare
I'll hold off for the RTX 7090. It will be the same configuration as the RTX PRO 6000 Blackwell and cost the same as the RTX PRO 6000 Blackwell does today, but I'd like to eat between now and then.
What I'd like to know is why the card costs 50% more if you don't happen to live in Trumpedupistan, and that's accounting for regional taxes. I guess Jensen has decided to make the rest of the world pay for Trumpedup's internal tariffs, and his new private island.
reply
I'll hold off for the RTX 7090. It will be the same configuration as the RTX PRO 6000 Blackwell and cost the same as the RTX PRO 6000 Blackwell does today, but I'd like to eat between now and then.
What I'd like to know is why the card costs 50% more if you don't happen to live in Trumpedupistan, and that's accounting for regional taxes. I guess Jensen has decided to make the rest of the world pay for Trumpedup's internal tariffs, and his new private island.
reply
gamersnexus
With these tests, you should virtualize machines, and run multiple copies of the same game on a single GPU. Im my past experience, you need a separate license for each virtual GPU license from nVidia. It costs a lot more though. Like you need more OS licenses, a hypervisor, and licenses for the GPU for each virtual OS license. - This would be a gaming test. Because you would only currently purchase the GPU with the VRAM needed for a task, and not split it up.
reply
With these tests, you should virtualize machines, and run multiple copies of the same game on a single GPU. Im my past experience, you need a separate license for each virtual GPU license from nVidia. It costs a lot more though. Like you need more OS licenses, a hypervisor, and licenses for the GPU for each virtual OS license. - This would be a gaming test. Because you would only currently purchase the GPU with the VRAM needed for a task, and not split it up.
reply
Armabelico
Two years from now: We glad to present the new 7060 same performance as a 6000 pro and 30 elephants and your mom, the more you buy the more rich we are and we are the Borg your jobs will be trained and assimilated.... now with crayola generated very noticeable fake frames and cooled by a hand cranked fan with only a 8 GW power consumtion at 30 fps, your cables will melt and you get to buy another one, Endivia we don't give frames we generate them.
reply
Two years from now: We glad to present the new 7060 same performance as a 6000 pro and 30 elephants and your mom, the more you buy the more rich we are and we are the Borg your jobs will be trained and assimilated.... now with crayola generated very noticeable fake frames and cooled by a hand cranked fan with only a 8 GW power consumtion at 30 fps, your cables will melt and you get to buy another one, Endivia we don't give frames we generate them.
reply
gamersnexus
On the ML testing: I think it'd be better to keep the organization same as you do with gaming charts, meaning different colors are different GPUs instead of different models.
If you decide to do more of these, please make sure to provide versioning info for backend and apps used and preferably model sources, as even lmstudio can just get any flavor off of huggingface and there's a lot of different models out there, as you probably know.
reply
On the ML testing: I think it'd be better to keep the organization same as you do with gaming charts, meaning different colors are different GPUs instead of different models.
If you decide to do more of these, please make sure to provide versioning info for backend and apps used and preferably model sources, as even lmstudio can just get any flavor off of huggingface and there's a lot of different models out there, as you probably know.
reply
user-dh9mw8mh3t
For LLM testing you could simplify it and always run the same small model (7B or less). The speed increase between different cards will be roughly similar for all models as long as they fit in VRAM. That way you will avoid the trap of model spilling into regular ram. And if you do some research you could just be giving the biggest possible model (at Q4 probably) you can run at some arbitrary context length for each card.
reply
For LLM testing you could simplify it and always run the same small model (7B or less). The speed increase between different cards will be roughly similar for all models as long as they fit in VRAM. That way you will avoid the trap of model spilling into regular ram. And if you do some research you could just be giving the biggest possible model (at Q4 probably) you can run at some arbitrary context length for each card.
reply
vAnduCloudGaming
The price of all workstation Quadro 6000 cards has always been $10,000. Why is there so much talk about the price Even now, the NVIDIA RTX 6000 Ada, costs $10,000. The price hasn’t gone down.
The NVIDIA RTX 6000 Ada Generation GPU performance has been similar to the 4090, or about 9% higher. It has always been like this. mainly offering more VRAM and additional all data center features like vGPU.
reply
The price of all workstation Quadro 6000 cards has always been $10,000. Why is there so much talk about the price Even now, the NVIDIA RTX 6000 Ada, costs $10,000. The price hasn’t gone down.
The NVIDIA RTX 6000 Ada Generation GPU performance has been similar to the 4090, or about 9% higher. It has always been like this. mainly offering more VRAM and additional all data center features like vGPU.
reply
Add a review, comment
Other channel videos















