[RC5] CUDA clients need recompiling for newer GPUs

Mike Reed mikereedwt at gmail.com
Sat Jan 18 07:16:41 EST 2014


Hi,

This is excellent. :) Watch this space...

Kind regards,
Mike

On 18 January 2014 10:53, Greg Childers <jgchilders at mailaps.org> wrote:
> Hi,
>
> The newer CUDA sm_35 GPUs such as the GTX Titan and Tesla K20/K40 have a
> funnel shift instruction, which can be used as a 32-bit rotate.  No code
> changes are necessary.  The compiler recognizes the two shifts and or as a
> rotate, and does the right thing.  Here are the results of my testing on a
> Tesla K20.
>
> Current client:
>
> dnetc v2.9108-517-CTR-10070312 for CUDA 3.1 on Linux (Linux 2.6.32-279.14.1
> ...
>
> Please provide the *entire* version descriptor when submitting bug reports.
>
> The distributed.net bug report pages are at http://bugs.distributed.net/
>
>
> [Jan 18 10:47:30 UTC] RC5-72: using core #0 (CUDA 1-pipe 64-thd).
>
> [Jan 18 10:47:37 UTC] RC5-72: Benchmark for core #0 (CUDA 1-pipe 64-thd)
>
>                       0.00:00:04.63 [937,151,615 keys/sec]
>
> [Jan 18 10:47:37 UTC] RC5-72: using core #1 (CUDA 1-pipe 128-thd).
>
> [Jan 18 10:47:44 UTC] RC5-72: Benchmark for core #1 (CUDA 1-pipe 128-thd)
>
>                       0.00:00:04.44 [976,631,543 keys/sec]
>
> [Jan 18 10:47:44 UTC] RC5-72: using core #2 (CUDA 1-pipe 256-thd).
>
> [Jan 18 10:47:50 UTC] RC5-72: Benchmark for core #2 (CUDA 1-pipe 256-thd)
>
>                       0.00:00:04.42 [981,795,666 keys/sec]
>
> [Jan 18 10:47:50 UTC] RC5-72: using core #3 (CUDA 2-pipe 64-thd).
>
> [Jan 18 10:47:57 UTC] RC5-72: Benchmark for core #3 (CUDA 2-pipe 64-thd)
>
>                       0.00:00:04.74 [914,549,002 keys/sec]
>
> [Jan 18 10:47:57 UTC] RC5-72: using core #4 (CUDA 2-pipe 128-thd).
>
> [Jan 18 10:48:04 UTC] RC5-72: Benchmark for core #4 (CUDA 2-pipe 128-thd)
>
>                       0.00:00:04.58 [948,142,344 keys/sec]
>
> [Jan 18 10:48:04 UTC] RC5-72: using core #5 (CUDA 2-pipe 256-thd).
>
> [Jan 18 10:48:10 UTC] RC5-72: Benchmark for core #5 (CUDA 2-pipe 256-thd)
>
>                       0.00:00:04.40 [985,609,646 keys/sec]
>
> [Jan 18 10:48:10 UTC] RC5-72: using core #6 (CUDA 4-pipe 64-thd).
>
> [Jan 18 10:48:17 UTC] RC5-72: Benchmark for core #6 (CUDA 4-pipe 64-thd)
>
>                       0.00:00:04.51 [962,502,541 keys/sec]
>
> [Jan 18 10:48:17 UTC] RC5-72: using core #7 (CUDA 4-pipe 128-thd).
>
> [Jan 18 10:48:23 UTC] RC5-72: Benchmark for core #7 (CUDA 4-pipe 128-thd)
>
>                       0.00:00:04.40 [986,996,612 keys/sec]
>
> [Jan 18 10:48:23 UTC] RC5-72: using core #8 (CUDA 4-pipe 256-thd).
>
> [Jan 18 10:48:30 UTC] RC5-72: Benchmark for core #8 (CUDA 4-pipe 256-thd)
>
>                       0.00:00:04.36 [995,250,114 keys/sec]
>
> [Jan 18 10:48:30 UTC] RC5-72: using core #9 (CUDA 1-pipe 64-thd busy wait).
>
> [Jan 18 10:48:36 UTC] RC5-72: Benchmark for core #9 (CUDA 1-pipe 64-thd busy
> wait)
>
>                       0.00:00:04.53 [957,125,645 keys/sec]
>
> [Jan 18 10:48:36 UTC] RC5-72: using core #10 (CUDA 1-pipe 64-thd sleep
> 100us).
>
> [Jan 18 10:48:43 UTC] RC5-72: Benchmark for core #10 (CUDA 1-pipe 64-thd
> sleep 100us)
>
>                       0.00:00:04.58 [947,487,144 keys/sec]
>
> [Jan 18 10:48:43 UTC] RC5-72: using core #11 (CUDA 1-pipe 64-thd sleep
> dynamic).
>
> [Jan 18 10:48:50 UTC] RC5-72: Benchmark for core #11 (CUDA 1-pipe 64-thd
> sleep dynamic)
>
>                       0.00:00:04.57 [949,532,041 keys/sec]
>
> [Jan 18 10:48:50 UTC] RC5-72 benchmark summary :
>
>                       Default core : #0 (CUDA 1-pipe 64-thd)
>
>                       Fastest core : #8 (CUDA 4-pipe 256-thd)
>
>
> Recompiled client:
>
> dnetc v2.9110-519-CTR-11072023 for CUDA on Linux (Linux 2.6.32-279.14.1.el6
> ...
>
> Please provide the *entire* version descriptor when submitting bug reports.
>
> The distributed.net bug report pages are at http://bugs.distributed.net/
>
>
> [Jan 18 10:50:19 UTC] RC5-72: using core #0 (CUDA 1-pipe 64-thd).
>
> [Jan 18 10:50:24 UTC] RC5-72: Benchmark for core #0 (CUDA 1-pipe 64-thd)
>
>                       0.00:00:03.19 [1,367,772,924 keys/sec]
>
> [Jan 18 10:50:24 UTC] RC5-72: using core #1 (CUDA 1-pipe 128-thd).
>
> [Jan 18 10:50:29 UTC] RC5-72: Benchmark for core #1 (CUDA 1-pipe 128-thd)
>
>                       0.00:00:03.12 [1,395,317,272 keys/sec]
>
> [Jan 18 10:50:29 UTC] RC5-72: using core #2 (CUDA 1-pipe 256-thd).
>
> [Jan 18 10:50:35 UTC] RC5-72: Benchmark for core #2 (CUDA 1-pipe 256-thd)
>
>                       0.00:00:03.09 [1,409,888,624 keys/sec]
>
> [Jan 18 10:50:35 UTC] RC5-72: using core #3 (CUDA 2-pipe 64-thd).
>
> [Jan 18 10:50:40 UTC] RC5-72: Benchmark for core #3 (CUDA 2-pipe 64-thd)
>
>                       0.00:00:03.15 [1,383,760,581 keys/sec]
>
> [Jan 18 10:50:40 UTC] RC5-72: using core #4 (CUDA 2-pipe 128-thd).
>
> [Jan 18 10:50:45 UTC] RC5-72: Benchmark for core #4 (CUDA 2-pipe 128-thd)
>
>                       0.00:00:03.04 [1,435,289,273 keys/sec]
>
> [Jan 18 10:50:45 UTC] RC5-72: using core #5 (CUDA 2-pipe 256-thd).
>
> [Jan 18 10:50:50 UTC] RC5-72: Benchmark for core #5 (CUDA 2-pipe 256-thd)
>
>                       0.00:00:03.00 [1,454,768,816 keys/sec]
>
> [Jan 18 10:50:50 UTC] RC5-72: using core #6 (CUDA 4-pipe 64-thd).
>
> [Jan 18 10:50:55 UTC] RC5-72: Benchmark for core #6 (CUDA 4-pipe 64-thd)
>
>                       0.00:00:03.11 [1,402,262,966 keys/sec]
>
> [Jan 18 10:50:55 UTC] RC5-72: using core #7 (CUDA 4-pipe 128-thd).
>
> [Jan 18 10:51:00 UTC] RC5-72: Benchmark for core #7 (CUDA 4-pipe 128-thd)
>
>                       0.00:00:03.08 [1,416,234,138 keys/sec]
>
> [Jan 18 10:51:00 UTC] RC5-72: using core #8 (CUDA 4-pipe 256-thd).
>
> [Jan 18 10:51:05 UTC] RC5-72: Benchmark for core #8 (CUDA 4-pipe 256-thd)
>
>                       0.00:00:03.05 [1,429,372,665 keys/sec]
>
> [Jan 18 10:51:05 UTC] RC5-72: using core #9 (CUDA 1-pipe 64-thd busy wait).
>
> [Jan 18 10:51:11 UTC] RC5-72: Benchmark for core #9 (CUDA 1-pipe 64-thd busy
> wait)
>
>                       0.00:00:03.14 [1,386,448,528 keys/sec]
>
> [Jan 18 10:51:11 UTC] RC5-72: using core #10 (CUDA 1-pipe 64-thd sleep
> 100us).
>
> [Jan 18 10:51:16 UTC] RC5-72: Benchmark for core #10 (CUDA 1-pipe 64-thd
> sleep 100us)
>
>                       0.00:00:03.29 [1,324,972,176 keys/sec]
>
> [Jan 18 10:51:16 UTC] RC5-72: using core #11 (CUDA 1-pipe 64-thd sleep
> dynamic).
>
> [Jan 18 10:51:21 UTC] RC5-72: Benchmark for core #11 (CUDA 1-pipe 64-thd
> sleep dynamic)
>
>                       0.00:00:03.18 [1,371,027,270 keys/sec]
>
> [Jan 18 10:51:21 UTC] RC5-72 benchmark summary :
>
>                       Default core : #0 (CUDA 1-pipe 64-thd)
>
>                       Fastest core : #5 (CUDA 2-pipe 256-thd)
>
>
> You can easily compile for all current architectures by removing the ptx and
> cubin lines, and compiling with
>
> NVCC = /usr/local/cuda/bin/nvcc --generate-code arch=compute_10,code=sm_10
> --generate-code arch=compute_20,code=sm_20 --generate-code
> arch=compute_30,code=sm_30 --generate-code arch=compute_35,code=sm_35
>
> Greg
>
>
> _______________________________________________
> rc5 mailing list
> rc5 at lists.distributed.net
> http://lists.distributed.net/mailman/listinfo/rc5
>


More information about the rc5 mailing list