Peakload benchmarks for operating systems

Dima · June 21, 2026, 11:44am

Peak-load benchmarks for operating systems: Linux runs over 5,000× slower than our virtual machines

We applied place-transition nets (PTNs) defined by System V semaphores (https://doi.org/10.1080/17445760.2026.2615010) to benchmark Linux (Ubuntu 24.04.4 LTS, kernel 6.17.0-35).

Using PTNs for matrix multiplication and arrays of concurrent multiplications, we compared Linux kernel performance with that of our virtual machines (https://doi.org/10.1080/17445760.2025.2490148).

A PTN executing 1,024 parallel multiplications of 6-bit data completed in 0.912 seconds on our VM, compared with 5,673.597 seconds on Linux running on the same hardware (AMD Ryzen 7 6800H @ 4.8 GHz, 32 GB RAM). The application contains 9,216 semaphores (places) and 8,192 processes (transitions).

The Linux execution time is more than 5,000 times slower than that of our VM. We believe this gap cannot be explained solely by system-call and context-switching overhead. Instead, it points to the efficiency of the System V semaphore implementation in the Linux sem.c kernel module.

We are interested in collaborating on projects aimed at implementing semaphores with wait-for-all semantics in Linux, both at the kernel level for processes and as a runtime mechanism for fast, futex-like thread synchronization. While futex_waitv provides wait-for-any semantics, wait-for-all semantics could help eliminate many deadlocks caused by sequential resource acquisition.

For modeling, we use Tina (The TINA toolbox Home Page - TIme petri Net Analyzer - by LAAS/CNRS) as an IDE and generate large PTN models with our own toolchains. Models are exported through our NDRtoALL plugin as .h files for the PVZ machine, then recompiled and executed as Linux applications.

Our basic tools are available on GitHub:

hydn · June 21, 2026, 11:45am

@Dima Welcome to our forums. Thanks for sharing.

ericmarceau · June 21, 2026, 4:56pm

Is your project strictly focused on the “scheduling” part of the Linux kernel?

Or are you also looking the actual “net” benefits, acknowledging that not all benefits can be realized because of context constraints?

I think that summary is comparing apples and oranges!

You need to provide some context parameters before putting out such extremely one-sided statements.

Is that a single computer setup or is it a network of peer computers sharing the workload?
How much memory is dedicated to the “engine” (the whole network) performing the work?
How many independant tasks are being performed simultaneously?
Are the tasks compute-oriented, memory-oriented, or I/O-oriented?
What is the intended market for such a compute paradigm: super-computing? multi-national multi-server computing? SME-server computing? desktop computing?

“Interesting” results … but … LOTS of unknowns!

Dima · June 21, 2026, 6:35pm

Thank you, dear Eric, for your interest in our R&D. We use PTNs as a graphical language of concurrent programming, developing virtual machines for MCU/GPU to run our applications in view of dedicated hardware (chips) implementation. Sleptsov and Salwicki amendments to Petri nets make them fast and massively parallel. Recently, as I proved that System V semaphores = inhibitor Petri nets, we compile our nets into data for a program called pvzm.c that uses semop() and fork() to run the net as Linux application. Today, to supplement our toolchains, I uploaded benchmark nets and applications to my GitHub. You can run them on your Linux. Thus, at present we benchmark a single computer though we can extend our benchmarks on clusters. The point is System V semaphores can be implemented much more efficiently both in the kernel and partially as runtime, like futexes. They resolve many deadlocks, and we can think of deadlock recognition on the fly, say with a system application started by a kernel. We also offered extended greedy semaphores, which have lots of advantages, and we think of the corresponding system call semopk() implementation. We worked in the kernel before implementing our novel stack of protocols e6 - https://www.ietf.org/archive/id/draft-zaitsev-e6-network-00.txt With warm regards, Dima, https://dimazaitsev.github.io/

Dima · June 21, 2026, 6:38pm

Thank you for your kind welcome, dear Hydn – Dima

ericmarceau · June 21, 2026, 6:40pm

Still trying to wrap my head around what is done.

Are you creating this “scheduler” as a captive, single-task component sheduler, albeit massive, complex and parallel task?

Dima · June 21, 2026, 6:47pm

It is an abstract machine based on a place-transition net. A place is represented by a semaphore. Transition is a child process created by fork that keeps repeating a single semop() system call, which sops correspond to the transition incident arcs. Thus, we use a System V array of semaphores as a computer because the system is Turing complete. Our benchmarks’ idea is using System V semaphores as a computer. And we can make this computer much faster – Dima

ericmarceau · June 21, 2026, 6:59pm

I’ve bookmarked this to revisit periodically.

I used semaphores at the “mini” scale, to avoid conflicts between 3 programs … way back!

I think I am good at abstract, but I’m afraid that this concept is too abstract for me, so I will leave it to others to give their interpretation regarding the intended “market” for this concept.

No intent to cast aspersions, but I’m out of my league here!

Dima · June 22, 2026, 3:37pm

Thank you for starting the discussion, dear Eric. Without detail, it is rather simple - System V semaphores with wait-for-all semantics and 3 operations, P, V, and Z, is a computer, like a Turing machine. We can compile any program into it and compute using the Linux kernel as a computer. It inflicts a peak workload on the kernel to evaluate its reliability and performance. The market is similar to those for meters—we measure which OS, kernel, or kernel module is better. I mean for using it as a benchmark while we can do much more but it is another story – Dima

ugnvs · June 22, 2026, 5:10pm

I beg your pardon, I can not just take these numbers for granted because of the single fact that top 500 supercomputers in the world are fuelled by Linux

Dima · June 26, 2026, 5:09pm

I invite you to check it yourself, dear Eugene. I uploaded benchmarks to my github. You can compare times for computing in Linux and on our VM application. Mind the current teaser contains 2 benchmarks; the biggest runs about 2 hours on my Linux machine. We plan to upload all generators of models and complete toolchains after publishing our next research paper. – Dima P.S. you will be tired of waiting for completion of the biggest benchmark, which is a good motivation to rewrite at least SVS efficiently; we dream of catching deadlocks and optimizing kernel work on-the-fly in the future.

roman35 · June 28, 2026, 11:38am

@Dima
These are truly compelling results. It would be fascinating to explore whether Petri Nets could help identify and address inefficiencies in Linux’s parallelization mechanisms at a deeper level.

Bombilla · June 28, 2026, 12:08pm

Welcome to the community @roman35,nice to meet you!

roman35 · June 28, 2026, 12:30pm

Thank you @Bombilla . Nice to meet you too!

Dima · June 28, 2026, 3:24pm

Thank you, Roman. The test reveals lack of parallelization in SVS implementation. A single kernel thread processes sequentially sops items using 2 passages: (1) check and (2) apply or block. Then, sops of blocked semop() are checked and possibly resumed. No magic, just lookup, two nested levels—square time complexity with minor heuristics. I invite kernel programmers to the UoD#1 university of the year in the UK PhD program to address the issue and implement greedy semaphores. – Dima

PKChomiuk · July 8, 2026, 10:07am

Very interesting work.
As I understand it, both papers focus on improving the execution layer of concurrent computation: reducing synchronization overhead, accelerating transition selection, and exploiting massive GPU parallelism while preserving the computation semantics.
Reading your work raised an interesting question. Have you considered extending the transition-selection mechanism with a higher-level coherence criterion rather than relying solely on priority and activation conditions?
In a distributed cognitive architecture that I am currently developing (FEDI – Fion-Enhanced Distributed Intelligence), autonomous reasoning units validate candidate states through a multi-stage epistemic consistency process (geometric consistency, spectral coherence, scaling invariance, coherence integration, and falsifiability testing). This naturally leads to the question of whether a similar coherence metric could be incorporated into concurrent scheduling.
Instead of selecting only the first enabled transition, one could imagine selecting the enabled transition that maximizes predicted global coherence or minimizes future synchronization conflicts, while preserving correctness. In other words, optimization would move from purely execution efficiency toward execution quality under distributed consistency constraints.
I do not claim that this would improve your algorithms, but I think it could represent an interesting direction for future research, especially where concurrent computation and distributed AI architectures begin to overlap.
Thank you for sharing your work. I found it both technically interesting and thought-provoking.

Bombilla · July 8, 2026, 10:20am

Welcome to our community @PKChomiuk, thanks for joining us!

Dima · July 8, 2026, 10:49am

Thank you very much for your deep and inspiring comment, dear PK. It aligns well with our future plans, enriched by criteria characteristic of your exciting FEDI research.

For simple, straightforward applications, such as concurrent computing, we are interested in simple decision criteria. Recently, the Sleptsov–Salwicki rule — a combination of multiple and maximal firing strategies — has been one of our main focuses. Instead of firing a single transition, we fire as many transitions as possible in one step, each with its maximal firing multiplicity.

To allow for future applications involving optimisation as the most significant direction (why search for a good solution when we can aim for the best one?), we have defined a weaker variant of the transition firing rules. In the case of the Sleptsov–Salwicki rule, this means that any sub-multiset of the maximal multiset may fire in a single step; the choice of the sub-multiset is determined by an external optimisation criterion.

I will take a closer look at FEDI. It may be possible to specify a Fion Unit as a transition or as a generalised System V semaphore, allowing us to combine our approaches in joint research.

We are moving towards dedicated hardware design for PTNs defined by semaphores, with FPGA-based prototyping as the next step.

With warm wishes,
Dima

ericmarceau · July 9, 2026, 12:55am

Would I correctly interpret the described mechanism as requiring its own custom CPU Instruction Set, and therefore dictate its own unique Chipset design? Or is the approach you are outlining one that can be implemented with “tweaking” of RISC-V ( RISC-V introduction by DigiKey ) open-source hardware designs?

Dima · July 9, 2026, 7:14am

Dear Eric,

Thank you for your interesting point of view and for connecting neighboring concepts.

At present, we treat System V semaphores as an abstract computer because they are Turing-complete. We use them as a virtual machine, which I call the PVZ machine, after Dijkstra’s P and V semaphore operations, supplemented with the waiting-for-zero operation (Z).

We transform the graphical language of place-transition nets, created with the graphical editor nd of Tina, into a Linux application using our NDRtoALL plugin. The plugin generates a .h file that is included in the PVZ machine (pvzm.c), which is then compiled with gcc.

You can compile and run pvzm.c yourself from the https://github.com/dimazaitsev/SNCtools/tree/main/bm5000x directory in the GitHub repository by editing the #include directive to select one of the pre-generated .h files. Please note that the largest example, gm32x32-b6.h, takes about two hours to run because it performs 32 × 32 = 1024 slow multiplications, similar to those described in gm6x6-a4.pdf.

Enjoy the run!

In fact, we only need fork() and semop(), and we are now moving toward prototyping a novel chipset on an FPGA. The architecture consists solely of child processes, each repeatedly calling semop(). Whenever a process is blocked, it simply idles, saving energy.

This forms a computing-memory structure without the traditional processor-memory bottleneck. Its entire instruction set is represented by the semaphore operations (sops) passed to semop(), which we visualize as a graph for clarity.

Yours, Dima

Topic		Replies	Views
Development and coding General Discussions software , programming , sysadmin , career	23	484	August 6, 2025
I was wrong! zswap IS better than zram Server & Sysadmin kernel , sysadmin	9	888	April 16, 2026
Perl programming General Discussions programming , career	12	177	May 21, 2026
Is Linux really real-time? RTOS, QNX and the PREEMPT_RT kernel General Discussions kernel , question	17	328	May 21, 2026
This is NOT a review of KDE vs Gnome vs Cinnamon vs Mate MATE Desktop desktop-environments , mate , cinnamon , kde , gnome	7	929	September 18, 2025

Peakload benchmarks for operating systems

Related topics