• VNOJ
  • Trang chủ
  • Danh sách bài
  • Các bài nộp
  • Thành viên
  • Tổ chức
  • Các kỳ thi
  • Thông tin
    >
    • Máy chấm
    • Custom Checkers
    • Github
VI EN Đăng nhập  hoặc  Đăng ký

Blog - Trang 1

  • Thông tin
  • Thống kê
  • Blog

0

specint test qualification specint2K6 round between the two cpus in comparison

yoshi_fp36 đã đăng vào 4, Tháng 6, 2026, 7:30

"An ounce of honest data is worth a pound of marketing hype" - SPEC

The "marketing hype" in question:

Now, since I want to do a fair comparison between something that may resembles the C86-4G and an Intel processor, here's the details if I ever got a chance at it:

1. System configuration.
Detail System 1 System 2
Operating system Fedora latest Fedora latest
Compiler GCC GCC
CPU Intel Core i7-14700K Ryzen 9 3950X 2.8 GHz
SSD 512GB Gen4x4 512GB Gen4x4 (same model as System 1)
RAM 64GB DDR5-6000 CL28 64GB DDR4-3600 CL14
Chipset Z790 X570
GPU IGP/Headless IGP/Headless
TDP 253W 142W

All processors are kept at stock settings and is configured so that no thermal throttling is detected during operation.

2. Testing procedure.

SPEC CPU 2006 will be the benchmark of choice. The testing procedure is as follows:

Compilation flags:

Optimization: -Ofast -fno-unsafe-math-optimizations -flto=<n_cores> -march=<march> -fomit-frame-pointer.

Portability: -DSPEC_CPU_LP64 -Wno-error=template-body (both base/peak).

For which the options are as follows:

Option System 1 System 2
n_cores 20 16
n_threads 28 32
march raptorlake znver2

For PGO runs, PGO options are enabled as follows:

PASS1_CFLAGS = -fprofile-generate
PASS1_CXXFLAGS = -fprofile-generate
PASS1_FFLAGS = -fprofile-generate
PASS1_LDFLAGS = -fprofile-generate
PASS2_CFLAGS = -fprofile-use
PASS2_CXXFLAGS = -fprofile-use
PASS2_FFLAGS = -fprofile-use
PASS2_LDFLAGS = -fprofile-use

Options such as bindN (N < n_threads) are configured so that bindN for N < n_cores maps to the physical threads.

Now, the testing procedure starts:

First, for SPEC06 STint, the testing procedure is as follows:

Launch base run and basePGO (as peak) run as above with 1 iteration.

Now, the peak tuning is generated by determining the effects of PGO on execution times (win/loss). After the PGO determination step for the peak tuning has been done, we do a full 3-iterations run and report the base/peak result.

For SPEC06 MTint, the testing procedure is as follows:

Launch base run and basePGO (as peak) run as above with 1 iteration, setting the number of copies to n_threads.

Next, launch the same configurations, but instead of setting the number of copies to n_threads, set it to n_cores.

Now, the peak tuning is generated by determining the effects of PGO and SMT on execution times (win/loss). After the PGO/SMT determination step for the peak tuning has been done, we do a full 3-iterations run and report the base/peak result.

All reported files for the final 3-iterations runs are made public (PDF, HTML, raw, flags, etc.), and the result is aggegrated to the tables as follows:

Benchmark System 1 System 2
SPEC06 STint base
SPEC06 STint peak
SPEC06 MTint base
SPEC06 MTint peak

One can also plot graphs based on the test results.

The steps above should be simmilar for SPEC06 ST/MTfp scores, but optimization/portability options might need a change.

yoshi_fp36
o4, Tháng 6, 2026, 7:30 2

0

specint boost

yoshi_fp36 đã đăng vào 4, Tháng 6, 2026, 2:05

in internal testing skylake is already crushing 16 Specint2K6/Ghz just by replacing the 462.libquantum compiler.

462.libquantum is in the situation of 072.eqntott in Spec89, where it's obsolete by modern standards.

yoshi_fp36
o4, Tháng 6, 2026, 2:05 1

-1

spacemit k3 clarifications

yoshi_fp36 đã đăng vào 31, Tháng 5, 2026, 15:29

technically a 16 core but it's not that great.

default: 8 cores. due to isa incompatibility, since X100 implements Rv64H.

next is a100 sucks for anything but feeding the npu, so the A100 is actually a 4 core programmable npu in disguise. even spacemit doesnt want you to run normal apps on "AI cores".

10mb l2 cache but the cache itself is already LLC and 30 cycle L2 latency on X100 is subpar compared to eg. gracemont. vector throughput is also not impressive by any means.

"Quad issue out of order", yet can't seem to reach A76 performance. 8 X100 cores sometimes has an edge over Pi 5, and one would usually compare it to a RK3588, which kinda fits the description. to be honest the X200 is going to reach skylake performance, but it's 6-wide, and it's not near even X2 performance (which isnt even peak 6-wide technology...)

yoshi_fp36
o31, Tháng 5, 2026, 15:29 2

-5

the battle of the absolute giants

yoshi_fp36 đã đăng vào 27, Tháng 5, 2026, 9:44

as i have come to ts conclusion...

lets fight

on the left side

it's the greatest cpu money can buy

Intel(R) Xeon(R) Platinum 9282 Processor core ultra X9 388H ultra whatever i don't care. SHUT UP

right side

its the new one in town, jsut wake up

rockchip... (?) rk3688 12-core

last time we decided to do a quick "Comparison" of X9 388h vs spacemit k3... you know how that went. so time to get something that's at least competent...

focusing on the memory side first. rk3688 comes out on top with lpddr6, but is equally sided if lp5 was used. i don't have high hopes for latency and cpu bandwidth though.

gpu: as this is the least competent part of the x9 388h, i'll pass. OH WAIT, You remember who I am right? XJXIIJWIEIQJIJJ!OFOOADR!! OF COURSE IM GONNA MAKE USELESS COMPARISONS... LETS GO BOISSS

x9 388 Gpu: mali magni >2.0 TFLOPS

Rk36XX GPU: INTEL ARC BXXX > 7.0 TFLOPS

RK36XX is straight up mogmaxxing x9 388 in raw compute alone and it can even play games... how wonderful

Lets focus on the main party: the cpu.

cpu demonstration table is down below.

Core X9 3838588248219321499533124759132H Rockchjp RK36294994192491532042032425 Rockchjp RK36294994192491532042032405
Ultra performance mogmaxxing cores 4X coigar Cove (4 cores), Intels latest specint2k6 architecture, > 2.2 specint 2K17/ghz, > 18 specint2K6GHz* (listen up, A730? you thought HIGH PERFORMANCE means catching up to Cortex-X2? LOLOLLLLL, you're still stuck in ancient clock speeds ) , P cores, born from lion cove, node shrink of lion cove., 5.1 Ghz can't even add a c0rtex-x967 core just for the sake of it!!! why, rockcjp? you're getting absolutely outmogged by 3838588248219321499533124759132H... just do it rockchjp i don't believe in you. now get out as this is the low end chip, i don't mind them. WRONG. DEAD WRONG. ONE SHOULD HAVE A C0RTEX X4. GET OUT.
Low performance efficiency cores 4X darkmont lpe, node shrink of skymont. easily mogmaxxes the rk36294994192491532042032425 everyday... 4x c1 nano/a530, arms latest e-waste architecture, > 0.8 specint 2K17/ghz, > 6 specint2K6GHz* (arm should be ASHAMED of selling this in the BIG 27) , up to 2.XX Ghz 6X c1 nano/a530, arms latest e-waste architecture, > 0.8 specint 2K17/ghz, > 6 specint2K6GHz* (arm should be ASHAMED of selling this in the BIG 27) , up to 2.XX Ghz. R0CKCJP SHOULD BE ASHAMED OF SPAMMING SIX OF THESE E-WASTE CORES AS SOME FORM OF ANNOYANCE IN THEIR LATEST PRODUCTS. SHOWS HOW THEIR RELATIONSHIP WITH THE A5X LINES NEVER ENDED, THEY'RE JUST PUTTING A7X for SMOKES AND CAN'T BE COMPETENT ENOUGH TO PUT AN X-c0RE.
efficiency cores 8X Darkmont, node shrink of skymont. , 3.8 GHz,Intels latest densemaxxing architecture, > 2.0 specint 2K17/ghz, > 16 specint2K6GHz*, new intels atom architecture, rk36XX gonna be watching... 8X cortex a730/c1 pro/ whatever inconsistent arm naming they just invented 10 seconds ago (8 cores), arms latest e-core architecture, > 2.2 specint 2K17/ghz, > 18 specint2K6GHz* (SHUT UP, INTEL!!! YOU ARE NOT INVITED TO THE PARTY) , up to 3.XX Ghz (reclassified since this intels e core levels cores and not p, sorry rockcjp) 4X cortex a730/c1 pro/ whatever inconsistent arm naming they just invented 10 seconds ago (4 cores), arms latest e-core architecture, > 2.2 specint 2K17/ghz, > 18 specint2K6GHz* (YOU THOUGHT YOUR "SKYMONT" TRASH WAS COOL, HUH?) , up to 3.XX Ghz (rockcjp is in 2022 with these technologies...)

tbh rockcjp is producing these on like tsmc n6 so its a l for them... intels got a muh better node...

conclusion: dpendeds on your choce. if you want an arm sbc, maybe the RK36294994192491532042032425 is the choice that you may love. lol. x9 3838588248219321499533124759132h might not be even for you if you demand absolute mt performance. try something else like 270k or threadripper. good bye...

Bonus:

arm c1-ultra, arms latest "x-series"core carchitescture, > 3.1 specint2k17/ghz, > 25 specint2k6GHZ, arm is straight up mogging intel and amd in their own game, arms cooking rn, only losing to apple, sad for arm, arm still hasnt been able to keep up, and rockcjp would never integrate that to their socs

yoshi_fp36
o27, Tháng 5, 2026, 9:44 17

0

x9 388h vs spacemit k3 processor???

yoshi_fp36 đã đăng vào 27, Tháng 5, 2026, 6:31

unfair comparison of two 16 core processors...

lets go

Spacemit k3:

8X X100 (2 clusters), X100 > Specint2K6/ghz of 9.0 (this is so funny), X100 supports the latest features, double pumped 256b vectors (compare: gracemont), Up to 2.5 GHZ...

8XA100 ("AI cores"), 15 TOPS inbuilt NPU, Based on older X60 architecture > Specint2K6/GHZ of 5.0, not accessible in normal usage, Only for AI CORES!!!, BIIIIIIIG VECTORS, 1024B VECTORS, 2X256B FMA (INTEL, YOU MUST LISTEN TO THIS, X9 388H, THEY ARE CATCHING UP!!). 2.0 Gjz..... because GJZ IS OUR SURVIVOR

CACHE: 4MBX2 (X100)+1MB X2 (A100) (again, gracemont would like to have a talk, its so funny)

X9 388H:

Cluster 1 4X coigar Cove (4 cores), Intels latest specint2k6 architecture, > 2.2 specint 2K17/ghz, > 18 specint2K6GHz* (listen up, X100? you thought HIGH PERFORMANCE means catching up to A76? LOLOLLLLL) , P cores, born from lion cove, node shrink of lion cove., 5.1 Ghz

8X Darkmont, node shrink of skymont. , 3.8 GHz

4x Lpe, which is going to still be more powerful than 8X X100 combined. LollllllUlz... 3.7 Gjz...

CACHE: 18M+3M * 4(L2)+4M*3(L2)

UNDOUBTEDLY THE K3 WINS. IN EVERY SINGLE THING. IF YOU WERE COMPARING BIIIIIG VECTORS, THAT IS.

* pretty sure if you used intel cheesemaxxing compilerz then it can get > 22 Specint2K6/GHZ. Just for lulz...

and we don't talk about the Arc B3XX (ceensored) GPU. It's not a fair contest to compare that to the greatest matrix engine ever.

as such i've collected some more details... looks like 4 engines that can do 8x16x8 matrix multiplications in A100... we'll wait for the next generation with X86's ACE to "compete" with this.

(compete? well... AMX is already 16c/52c for a full 16x64x16 matrix multiplication. Its so unfunny that I, X86 himself, would consider competing to something that's obviously miles superior....)

yoshi_fp36
o27, Tháng 5, 2026, 6:31 2

0

intel lakefield vs 270K plus

yoshi_fp36 đã đăng vào 27, Tháng 5, 2026, 6:11

intel lakefield (10th gen)

power : 7W

cores: Intel core, Sunny cove (10.5th gen), 3.0 ghz, 4X Intel tremont (10th gen "Atom" line), 2.8 ghz, all core turbo 1.8 Ghz...

Memory: 8gb Lp4X

Cache: 4MB, 1.5MB shared across Tremont cluster, 512K For sunny cove

performance: Sunny cove : +60-70% ipc vs Tremont, Tremont: 1.14 SIR2K17 / GHZ ...

intel 270K plus (17th gen)

power : 253W

cores: Intel core, 8X Lion cove (16th gen), 5.4 ghz, 16X Intel Skymont (16th gen "Atom" line), 4.7 ghz, 5.5 ghz on favored cores

Memory: Up to 256G DDR5

Cache: 36MB, 3mb per lion cove core, 4mb per skymont cluster (40MB l2), 192K per lion cove core L1.5, ...

performance: Lion cove : +17% ipc vs Skymont, Skymont: 1.10 SIR2K26 / GHZ ...

conclusion: choose your sacrifice. there's no perfect cpu...

yoshi_fp36
o27, Tháng 5, 2026, 6:11 1

-11

tôi ghét yoshi

yoshi_fp36 đã đăng vào 24, Tháng 5, 2026, 22:04

yoshi không xứng đáng làm nhân vật ngang hàng với con người

yoshi_fp36
o24, Tháng 5, 2026, 22:04 6

-10

100% rickroll no virus

yoshi_fp36 đã đăng vào 18, Tháng 5, 2026, 3:27

click here to get rickrolled 100% legit

click here to get coconut malled 0 virus, 0 strings, 0 requirements, 0 ads

yoshi_fp36
o18, Tháng 5, 2026, 3:27 2

0

testing

yoshi_fp36 đã đăng vào 17, Tháng 5, 2026, 10:31

GA102-200-KD-A1 @ 2.01 GHz:

  • FP64 TFLOPS: 546.7 GFLOPS (1:64)
  • FP64 WMMA TFLOPS: 546.7 GFLOPS (1:64)
  • Int64 TOPS: 5.831 TOPS (1:6)
  • Int32 TOPS: 17.50 TOPS (1:2)
  • FP32 TFLOPS: 34.99 TFLOPS (1:1)
  • FP16 TFLOPS: 34.99 TFLOPS (1:1)
  • TF32 WMMA TFLOPS: 34.99 TFLOPS (1:1)
  • FP16 WMMA (.f32) TFLOPS: 69.98 TFLOPS (2:1)
  • FP16 WMMA TFLOPS: 139.96 TFLOPS (4:1)
  • Int8 dot product TOPS: 69.98 TOPS (2:1)
  • Int8 WMMA TOPS: 279.92 TOPS (8:1)
  • Int4 WMMA TOPS: 559.84 TOPS (16:1)

Mali-G31 MP2 @ 650 MHz:

  • FP64 TFLOPS: 0.00 GFLOPS (1:~\infty~)
  • FP64 WMMA TFLOPS: 0.00 GFLOPS (1:~\infty~)
  • Int64 TOPS: 0.00 GOPS (1:~\infty~)
  • Int32 TOPS: 20.8 GOPS (1:1)
  • FP32 TFLOPS: 20.8 GFLOPS (1:1)
  • FP16 TFLOPS: 41.6 GFLOPS (2:1)
  • TF32 WMMA TFLOPS: 0.00 TFLOPS (1:~\infty~)
  • FP16 WMMA (.f32) TFLOPS: 0.00 TFLOPS (1:~\infty~)
  • FP16 WMMA TFLOPS: 0.00 TFLOPS (1:~\infty~)
  • Int8 dot product TOPS: 0.00 TOPS (1:~\infty~)
  • Int8 WMMA TOPS: 0.00 TOPS (1:~\infty~)
  • Int4 WMMA TOPS: 0.00 TOPS (1:~\infty~)

Mali-G31 MP2 @ 1.094 THz:

  • FP64 TFLOPS: 0.00 GFLOPS (1:~\infty~)
  • FP64 WMMA TFLOPS: 0.00 GFLOPS (1:~\infty~)
  • Int64 TOPS: 0.00 GOPS (1:~\infty~)
  • Int32 TOPS: 35.01 TOPS (1:1)
  • FP32 TFLOPS: 35.01 TFLOPS (1:1)
  • FP16 TFLOPS: 70.02 TFLOPS (2:1)
  • TF32 WMMA TFLOPS: 0.00 TFLOPS (1:~\infty~)
  • FP16 WMMA (.f32) TFLOPS: 0.00 TFLOPS (1:~\infty~)
  • FP16 WMMA TFLOPS: 0.00 TFLOPS (1:~\infty~)
  • Int8 dot product TOPS: 0.00 TOPS (1:~\infty~)
  • Int8 WMMA TOPS: 0.00 TOPS (1:~\infty~)
  • Int4 WMMA TOPS: 0.00 TOPS (1:~\infty~)

Conclusion: Mali-G31 MP2 wins!!*

*If you can clock it to over 1 trillion Hz as the above. YMMV.

yoshi_fp36
o17, Tháng 5, 2026, 10:31 3

-1

tuning test

yoshi_fp36 đã đăng vào 15, Tháng 5, 2026, 15:46

following the testing methodology mentioned in the previous blog, here's an example result:

CPU: R7 7840U

GPU: RTX 3080 @ PCIe Gen4x4 (this is a HUGE limitation, see below for why)

Model: Qwen3.6-35B-A3B-MXFP4_MOE.gguf, target ctx = 69632

ub ncm min pp4k@d8k tg128@d8k
512 32 294.34 49.2
1024 33 498.29 48.41
1536 34 609.46 46.83
2048 35 809.02 (only 1.9X faster than 1660 Ti, slower than a 3060 on full bus!) 46.52
2560 36 793.59 45.86
3072 37 785.74 (inconclusive)
3584 38 (not tested) (not tested)
4096 39 (not tested) (not tested)

As you can see ub=2048 only loses about 7% of TG performance for 2.7X PP performance.

Peak power is only 210W and prompt processing TPS is only 800tps, even slower than 2xT4 while the card on itself should be way better. As you can see ub=2048 already saturates the (pretty slow) PCIe bus. The result will definitely look a lot better given a proper link (4x the speed), since a GTX 1660 Ti already can saturate this link.

yoshi_fp36
o15, Tháng 5, 2026, 15:46 1
  • «
  • 1
  • 2
  • 3
  • »

dựa trên nền tảng DMOJ | theo dõi VNOI trên Github và Facebook