
Secret GPU: RTX 2080 in the RTX 2060 KO, Up to +47% Workstation Performance
video description
Date: 2020-05-06
Related videos
Comments and reviews: 10
Michael
Looking at the die organization, TU104 has 48 SMs in 6 GPCs (8 SMs each, while TU106 has 36 SMs in just 3 GPCs (12 each. In 2060 KO/FE SMs are reduced to 30 -- In FE the obvious way to reduce SMs is by disabling 2 SMs from each of the 3 GPCs (side note: Turing SMs are paired into TPCs. But for TU104 the way to reduce SMs is less obvious, I believe (but am not certain) that a TPC can't be split, so you can't do the obvious and just disable 3 SMs from each of the 6 GPCs -- now you have a choice: do you keep all 6 GPCs or not? If you keep 6 GPCs, half of them will have 6 SMs and the other half will have only 4 SMs. On the other hand, if you reduce GPCs by 1, then the remaining 5 will have 6 SMs each. The 2070FE was specified with 40 SMs and either 5 or 6 GPCs so its possible that 2060KO is similarly ambiguous -- GPUs are generally pretty able to deal with those kinds of asymmetries; but I wonder, actually, if in the hypothetical 6 SM 2060KO where some GPCs have only 4 SMS, that this might be a bigger hurdle. Not the asymmetry, but the fact that a 4 SM GPC has only half of the compute resources it was designed to support. In the 2070FE there would never be fewer than 6/8 SMs per GPC, only 25% from nominal -- a difference like that might be beneficial because some workloads might prefer their SMs fat and happy feeding from extra TPC resources. But if only 4/8 SMs are enabled, you might spend less time thinking your SMs look fat and happy than you spend thinking your GPCs look starved and sad. Combined with the extra flexibility of being able to bin an entire defective GPC away, I suspect the 2060KO is 5 GPCs with 6 SMs each. The target SM count is low enough that they can absorb at least one defective SM pair per GPC, and they can simply choose any GPC with either its own higher-level defect or with more than one defective SM pair to be disabled. A 5: 3 advantage in GPC resources could benefit some workloads 50%. There's probably also an advantage in a GPC managing just 6 SMs rather than 10.
reply
Looking at the die organization, TU104 has 48 SMs in 6 GPCs (8 SMs each, while TU106 has 36 SMs in just 3 GPCs (12 each. In 2060 KO/FE SMs are reduced to 30 -- In FE the obvious way to reduce SMs is by disabling 2 SMs from each of the 3 GPCs (side note: Turing SMs are paired into TPCs. But for TU104 the way to reduce SMs is less obvious, I believe (but am not certain) that a TPC can't be split, so you can't do the obvious and just disable 3 SMs from each of the 6 GPCs -- now you have a choice: do you keep all 6 GPCs or not? If you keep 6 GPCs, half of them will have 6 SMs and the other half will have only 4 SMs. On the other hand, if you reduce GPCs by 1, then the remaining 5 will have 6 SMs each. The 2070FE was specified with 40 SMs and either 5 or 6 GPCs so its possible that 2060KO is similarly ambiguous -- GPUs are generally pretty able to deal with those kinds of asymmetries; but I wonder, actually, if in the hypothetical 6 SM 2060KO where some GPCs have only 4 SMS, that this might be a bigger hurdle. Not the asymmetry, but the fact that a 4 SM GPC has only half of the compute resources it was designed to support. In the 2070FE there would never be fewer than 6/8 SMs per GPC, only 25% from nominal -- a difference like that might be beneficial because some workloads might prefer their SMs fat and happy feeding from extra TPC resources. But if only 4/8 SMs are enabled, you might spend less time thinking your SMs look fat and happy than you spend thinking your GPCs look starved and sad. Combined with the extra flexibility of being able to bin an entire defective GPC away, I suspect the 2060KO is 5 GPCs with 6 SMs each. The target SM count is low enough that they can absorb at least one defective SM pair per GPC, and they can simply choose any GPC with either its own higher-level defect or with more than one defective SM pair to be disabled. A 5: 3 advantage in GPC resources could benefit some workloads 50%. There's probably also an advantage in a GPC managing just 6 SMs rather than 10.
reply
Xeno
Gamers Nexus I posted a comment previously on my speculation that the KO card might have more than 30 SMs but with the extra SMs just having a reduced number of processing blocks (my old comment is still here on the comment page, and I wrote a short test CUDA kernel to test this speculation out. I currently only have regular 2060 cards and haven't been able to get a KO card yet so I was wondering if maybe you'ld like to test the kernel out on your KO card. If you compile and run it via the command: ./test 0 30 This will run a kernel on the card (first parameter selects the device, so use 0 if you have just a single card system, which will launch 30 thread blocks on the device. The test kernel allocates 48KB of shard memory per thread block, so it ensures that only one block is allocated per SM at any given time. So with 30 blocks launched it will run in around 2 secs on a typical 2060. If you then run it via the command: ./test 0 31 Then it will launch a kernel with 31 thread blocks which on a regular 2060 (with only 30 SMs) will run in about 4 secs, because now the very last block has to queue on an SM before it can be executed since the 48KB shared memory allocation forces only one block allocation per SM. If my speculation on the KO card is correct then the same command for 31 thread blocks should still run in about 2 secs since the speculation is that the KO card has at least 31 (or more) SMs. You can also query the device for the SM count device parameter but I want to programmatically test for more than 30 SMs since I want to be sure the Nvidia device driver isn't reporting 30 SMs when in fact the KO card really has more. (source code follows) #include #include #include #define SHRDIDXSZ 48 1024 #define OUTRLP 12 1024 __global__ void testKern(int iarr) __shared__ char shrdarr[SHRDIDXSZ]; int idx = blockDim. x blockIdx. x + threadIdx. x; int sum=idx; for (int j=0; j
reply
Gamers Nexus I posted a comment previously on my speculation that the KO card might have more than 30 SMs but with the extra SMs just having a reduced number of processing blocks (my old comment is still here on the comment page, and I wrote a short test CUDA kernel to test this speculation out. I currently only have regular 2060 cards and haven't been able to get a KO card yet so I was wondering if maybe you'ld like to test the kernel out on your KO card. If you compile and run it via the command: ./test 0 30 This will run a kernel on the card (first parameter selects the device, so use 0 if you have just a single card system, which will launch 30 thread blocks on the device. The test kernel allocates 48KB of shard memory per thread block, so it ensures that only one block is allocated per SM at any given time. So with 30 blocks launched it will run in around 2 secs on a typical 2060. If you then run it via the command: ./test 0 31 Then it will launch a kernel with 31 thread blocks which on a regular 2060 (with only 30 SMs) will run in about 4 secs, because now the very last block has to queue on an SM before it can be executed since the 48KB shared memory allocation forces only one block allocation per SM. If my speculation on the KO card is correct then the same command for 31 thread blocks should still run in about 2 secs since the speculation is that the KO card has at least 31 (or more) SMs. You can also query the device for the SM count device parameter but I want to programmatically test for more than 30 SMs since I want to be sure the Nvidia device driver isn't reporting 30 SMs when in fact the KO card really has more. (source code follows) #include #include #include #define SHRDIDXSZ 48 1024 #define OUTRLP 12 1024 __global__ void testKern(int iarr) __shared__ char shrdarr[SHRDIDXSZ]; int idx = blockDim. x blockIdx. x + threadIdx. x; int sum=idx; for (int j=0; j
reply
MickyMouseLimited
There is no guaranteed. I had nvidia 6800 that I purchased from the very first batch the card came with an 6800 ultra core 16 pipes and 6 vertex generators all i had to do is unlocked them with software. The card was able to clock almost up to the speed of the ultra but it had only one extra 4 pin power connector it reached 90 % of the clock speed of the ultra. Later a friend of mine purchased 6800 GT from a new batch. The card was a higher model but it did have defective pipes and vertex generators also it was not as stable when clocked up. Please don't spread facts that are not always guaranteed. This is not a good advice. Also the same is true about AMD. I have personal experience with 2600 pro. Manufactures will always sale something at specific price because they know it has an issue with the GPU. In my experience I purchased 2600 pro at lower price compering with 2600 xt hoping that I can clock it up. I did it everything was good initially but after 2 days it started to develop artifacts and I had to replace the card.
reply
There is no guaranteed. I had nvidia 6800 that I purchased from the very first batch the card came with an 6800 ultra core 16 pipes and 6 vertex generators all i had to do is unlocked them with software. The card was able to clock almost up to the speed of the ultra but it had only one extra 4 pin power connector it reached 90 % of the clock speed of the ultra. Later a friend of mine purchased 6800 GT from a new batch. The card was a higher model but it did have defective pipes and vertex generators also it was not as stable when clocked up. Please don't spread facts that are not always guaranteed. This is not a good advice. Also the same is true about AMD. I have personal experience with 2600 pro. Manufactures will always sale something at specific price because they know it has an issue with the GPU. In my experience I purchased 2600 pro at lower price compering with 2600 xt hoping that I can clock it up. I did it everything was good initially but after 2 days it started to develop artifacts and I had to replace the card.
reply
HotaruHino
So I'm going to put up my theory on why the TU-104-150 performs better in these applications than the TU-106-200 in the vanilla RTX 2060, even though both have the same SM count, same ROP count, and same memory interface performance. The TU-104 organizes the SMs differently than the TU-106. The TU-104 has 8 SMs per GPC, the TU-106 has 12 SMs per GPC. Each GPC has a raster engine in it. Both the TU-104-150 and TU-106-200 have 30 SMs. Some basic math will tell you that both GPUs will have a different GPC count, with the TU-104-150 having one extra GPC. Which means it also has to have an extra raster engine. From what I gather, a lot of professional applications focus more on geometry detail than pixel shading. Since raster engines work on translating triangles into pixels, it makes sense that the GPU with more raster engines can perform better on applications that focus on geometry detail.
reply
So I'm going to put up my theory on why the TU-104-150 performs better in these applications than the TU-106-200 in the vanilla RTX 2060, even though both have the same SM count, same ROP count, and same memory interface performance. The TU-104 organizes the SMs differently than the TU-106. The TU-104 has 8 SMs per GPC, the TU-106 has 12 SMs per GPC. Each GPC has a raster engine in it. Both the TU-104-150 and TU-106-200 have 30 SMs. Some basic math will tell you that both GPUs will have a different GPC count, with the TU-104-150 having one extra GPC. Which means it also has to have an extra raster engine. From what I gather, a lot of professional applications focus more on geometry detail than pixel shading. Since raster engines work on translating triangles into pixels, it makes sense that the GPU with more raster engines can perform better on applications that focus on geometry detail.
reply
Clyde
Good stuff. GPUs are my biggest downfall. I m not a gamer and 98% of everything is revolved around just that. I m in the process of a new build and been out of this a while. I will be using after effects and premier pro mostly. I m going with and I know you don t like it but gigabyte z390 pro 64 gigs ram at 3200 MHz 3 Sony m. 2 nvme 2 are 500 gig and 1 is 1t. Went with the Sony evo 970 I m putting in a m. 2 Pcie expansion card. Leaning towards i7 9700k. Gpu I was looking at was the Rtx 2060 super but after watching this I m just lost. I m trying my best to understand these GPUs and I figure 420 for one even if it s overboard it s future proof for my needs. If you have time please help. Thank you Ryan
reply
Good stuff. GPUs are my biggest downfall. I m not a gamer and 98% of everything is revolved around just that. I m in the process of a new build and been out of this a while. I will be using after effects and premier pro mostly. I m going with and I know you don t like it but gigabyte z390 pro 64 gigs ram at 3200 MHz 3 Sony m. 2 nvme 2 are 500 gig and 1 is 1t. Went with the Sony evo 970 I m putting in a m. 2 Pcie expansion card. Leaning towards i7 9700k. Gpu I was looking at was the Rtx 2060 super but after watching this I m just lost. I m trying my best to understand these GPUs and I figure 420 for one even if it s overboard it s future proof for my needs. If you have time please help. Thank you Ryan
reply
thechannelitrollwith
this is just what the doctor ordered. i have a 1080 ftw3 running right now in both blender and games but the optix render engine makes any rtx card a powerhouse that punches double over previous high tier card using CUDA. i was just gonna get a cheap one to tide myself over til ampere since to replace my 1080 for gaming requires a 2070(S. if i just slot this thing in with my 1080 til ampere comes out i ll have ray tracing and optix covered and normal gaming performance of my 1080. this is truly a godsend.
reply
this is just what the doctor ordered. i have a 1080 ftw3 running right now in both blender and games but the optix render engine makes any rtx card a powerhouse that punches double over previous high tier card using CUDA. i was just gonna get a cheap one to tide myself over til ampere since to replace my 1080 for gaming requires a 2070(S. if i just slot this thing in with my 1080 til ampere comes out i ll have ray tracing and optix covered and normal gaming performance of my 1080. this is truly a godsend.
reply
Dr
Gargling those nV nutters. Radeon VII garbage card, don't bother mentioning the workstation performance and it has a graphite pad the horror! Failed 2070 dies, sucking more power making more heat than actual 2060 dies, same gaming performance, BUT glorious uplift in workstation performance! Oh and no hate on graphite pads since 'Carbonaut' came out either? Uhh. Wait, what? You know, you got so upset about the nV shill accusations, maybe there was some truth. Accepting those freebies colouring your opinions? -_-
reply
Gargling those nV nutters. Radeon VII garbage card, don't bother mentioning the workstation performance and it has a graphite pad the horror! Failed 2070 dies, sucking more power making more heat than actual 2060 dies, same gaming performance, BUT glorious uplift in workstation performance! Oh and no hate on graphite pads since 'Carbonaut' came out either? Uhh. Wait, what? You know, you got so upset about the nV shill accusations, maybe there was some truth. Accepting those freebies colouring your opinions? -_-
reply
Riley
I know this is late, but for what it's worth, y'all should know that a recent update to VRay Next now supports RTX in its GPU renderer. It's an early foray for Chaos Group, but the improvements over CUDA are promising. Sadly they also announced discontinued support of OpenCL with the update but, from experience, I'm certain AMD pulled the plug on OpenCL way before Chaos Group did. This card will be a great value buy for any 3D productivity software that uses VRay, and that's a lot.
reply
I know this is late, but for what it's worth, y'all should know that a recent update to VRay Next now supports RTX in its GPU renderer. It's an early foray for Chaos Group, but the improvements over CUDA are promising. Sadly they also announced discontinued support of OpenCL with the update but, from experience, I'm certain AMD pulled the plug on OpenCL way before Chaos Group did. This card will be a great value buy for any 3D productivity software that uses VRay, and that's a lot.
reply
Gamers
This was extremely fun and I hope it gets some attention online: We spent days on this and it was really intriguing to work on and a great break from the usual reviews, sort of like a mystery. My hope is that some people deeper at NVIDIA see it and contact me to educate us on what's really happening here. We have confirmed the results with NVIDIA and EVGA, and now it's time to understand them. I've also reached out to David Kanter for assistance in learning why this happened.
reply
This was extremely fun and I hope it gets some attention online: We spent days on this and it was really intriguing to work on and a great break from the usual reviews, sort of like a mystery. My hope is that some people deeper at NVIDIA see it and contact me to educate us on what's really happening here. We have confirmed the results with NVIDIA and EVGA, and now it's time to understand them. I've also reached out to David Kanter for assistance in learning why this happened.
reply
TheSkepticSkwerl
Please start testing hashcat benchmarks. There is a pretty big market for hardware in the pen test community. If I can do 8 2060 Kos and get the same performance of a 2070 super. I would buy different hardware. I would just do like wpa2, ntlm, sha512 crypt and maybe sha1 passwords. It's a really easy test. Plus it might bring more attention to cyber security and help people realize to stop using easy passwords.
reply
Please start testing hashcat benchmarks. There is a pretty big market for hardware in the pen test community. If I can do 8 2060 Kos and get the same performance of a 2070 super. I would buy different hardware. I would just do like wpa2, ntlm, sha512 crypt and maybe sha1 passwords. It's a really easy test. Plus it might bring more attention to cyber security and help people realize to stop using easy passwords.
reply
Add a review, comment
Other channel videos















