manhquan2013đã bình luận lúc 29, Tháng 3, 2025, 12:41
Upvote please!
0
pt48583994đã bình luận lúc 22, Tháng 9, 2025, 10:56
Yes, here's the GTX 1660 Ti.
The GTX 1660 Ti has 1536 CUDA cores based on the Turing architecture, with a 192 bit memory bus. From what I see in the benchmarks across multiple softwares, The GTX 1660 Ti is on par with a 1070 or 1070 Ti. One specific example is llama.cpp, where the GTX 1660 Ti has about the same performance as a 1070 Ti due to newer architecture that benefits from stream-k work partitioning. It also features improved FP16, which has provides a slight boost of performance in the FA tile kernel. Comparing llama.cpp builds for Pascal and Turing shows that on Gpt-oss-20b, the Turing build performs much better at long context. Once again, if you compiled the Llama.cpp build for Pascal, it'll obviously run slower than a 1070 Ti due to lack of compute horsepower. There's also a need to skip the MMA code path, because the 1660 Ti does not have MMA units, and so, Nvidia added the MMA emulation thingy, probably to make applications not crash [That's actually evil tbh]. The fake MMA made Llama.cpp runs much slower by default, up to 5 times slower than peak performance. The RTX 2070 has 2x better performance due to the full MMA units. The FP16 units also benefits other AI softwares, such as image upscalers and even Lc0. The Fake MMA behavior was taken to another extreme with the GTX 1650 TU106, where tensor cores was detected but Cublas runs really slowly. It should be noted that , Cublas on my GPU runs about 0.7 TFLOPS with tensor cores mode and ~8 TFLOPS without. It seems like, Cublas is using the Volta kernels, which under Fake MMA works, but has very poor efficiency. Comparing peak-to-actual ratios shows that the MMA units is being emulated to be roughly 10 TOPS, much less than the 40 TOPS in Dp4a theoretical peak.
Bình luận
Upvote please!
Yes, here's the GTX 1660 Ti. The GTX 1660 Ti has 1536 CUDA cores based on the Turing architecture, with a 192 bit memory bus. From what I see in the benchmarks across multiple softwares, The GTX 1660 Ti is on par with a 1070 or 1070 Ti. One specific example is llama.cpp, where the GTX 1660 Ti has about the same performance as a 1070 Ti due to newer architecture that benefits from stream-k work partitioning. It also features improved FP16, which has provides a slight boost of performance in the FA tile kernel. Comparing llama.cpp builds for Pascal and Turing shows that on Gpt-oss-20b, the Turing build performs much better at long context. Once again, if you compiled the Llama.cpp build for Pascal, it'll obviously run slower than a 1070 Ti due to lack of compute horsepower. There's also a need to skip the MMA code path, because the 1660 Ti does not have MMA units, and so, Nvidia added the MMA emulation thingy, probably to make applications not crash [That's actually evil tbh]. The fake MMA made Llama.cpp runs much slower by default, up to 5 times slower than peak performance. The RTX 2070 has 2x better performance due to the full MMA units. The FP16 units also benefits other AI softwares, such as image upscalers and even Lc0. The Fake MMA behavior was taken to another extreme with the GTX 1650 TU106, where tensor cores was detected but Cublas runs really slowly. It should be noted that , Cublas on my GPU runs about 0.7 TFLOPS with tensor cores mode and ~8 TFLOPS without. It seems like, Cublas is using the Volta kernels, which under Fake MMA works, but has very poor efficiency. Comparing peak-to-actual ratios shows that the MMA units is being emulated to be roughly 10 TOPS, much less than the 40 TOPS in Dp4a theoretical peak.