Jump to content











Photo
- - - - -

ImDisk Toolkit: looking for benchmarks


  • Please log in to reply
6 replies to this topic

#1 v77

v77

    Silver Member

  • Team Reboot
  • 602 posts
  •  
    France

Posted 20 July 2014 - 11:03 AM

I create a new topic to ask whether someone would be nice enough to do some benchmarks of the dynamic ramdisks of ImDisk Toolkit (20140719), compared to a previous version (ideally the 20140710).

On my desktop machine, all the results are better, from 2 to 50% higher, according to CrystalDiskMark. I just did a test on my laptop, and it's rather the opposite, except in the last test.

I am thinking to undo the last change, but I would first like to know other results.

Technically, I simply replaced the APIs SetEvent and WaitForSingleObject by SignalObjectAndWait, since the Microsoft's documentation assures:
"The SignalObjectAndWait function provides a more efficient way to signal one object and then wait on another compared to separate function calls such as SetEvent followed by WaitForSingleObject."

So, if you have some minutes... Please give the text version of CrystalDiskMark (with Ctrl-C) since it includes additional informations. As I was thinking to use this API elsewhere (ProxyCrypt), this would be very helpful. Thanks in advance. :)



#2 v77

v77

    Silver Member

  • Team Reboot
  • 602 posts
  •  
    France

Posted 21 July 2014 - 06:15 PM

Here is the results I got on my laptop:

 

Before:

           Sequential Read :  1068.729 MB/s
          Sequential Write :   890.763 MB/s
         Random Read 512KB :  1119.178 MB/s
        Random Write 512KB :  1006.940 MB/s
    Random Read 4KB (QD=1) :   259.391 MB/s [ 63327.8 IOPS]
   Random Write 4KB (QD=1) :   254.353 MB/s [ 62097.9 IOPS]
   Random Read 4KB (QD=32) :   435.965 MB/s [106436.6 IOPS]
  Random Write 4KB (QD=32) :   438.298 MB/s [107006.3 IOPS]

 

With the new synchronization function:
           Sequential Read :  1027.512 MB/s
          Sequential Write :   866.735 MB/s
         Random Read 512KB :   997.396 MB/s
        Random Write 512KB :   924.086 MB/s
    Random Read 4KB (QD=1) :    47.988 MB/s [ 11715.8 IOPS]
   Random Write 4KB (QD=1) :    48.456 MB/s [ 11830.1 IOPS]
   Random Read 4KB (QD=32) :   900.596 MB/s [219872.0 IOPS]
  Random Write 4KB (QD=32) :   763.640 MB/s [186435.5 IOPS]

 

I just redid a test, and the results for 4KB (QD=1) are now 357 and 90 MB/s. In fact, they are very random.
However, getting in some cases a result 5x slower, even with good results for parallelized requests, I think this is unacceptable. That's why I just canceled the last change.

These synchronization functions are really a bottleneck for performances...



#3 Olof Lagerkvist

Olof Lagerkvist

    Gold Member

  • Developer
  • 1448 posts
  • Location:Borås, Sweden
  •  
    Sweden

Posted 22 July 2014 - 11:20 AM

Interesting find. It would also be interesting to find out if there would be any significant difference if the synchronization method was changed at the other end too, that is in the imdisk.sys driver. I use the kernel technique that corresponds to SignalObjectAndWait there and I have also used SignalObjectAndWait in all my user mode proxy services.

 

I have never thought about trying something else, since I learned many years ago that unnecessary context switches were bad in so many ways and had to be avoided to any cost (at least something in that direction). But now when you have started with signal/wait as two separate calls and found that to be faster in at least some scenarios, I am no longer sure what to think about it.

 

I am right now in the middle of development of a commercial vhd proxy service for ImDisk. But to avoid context switches and the extensive synchronization needs that follow, it is implemented as a kernel level driver instead of a user mode proxy service. IRP calls to another driver does not itself need thread context switching or synchronization operations, which means that all vhd file work is done directly in the context of the ImDisk device worker thread. This looks really promising when we measure performance. So, I think we could say for sure that context switches are expensive, but the question is how they could be tuned/tweaked/avoided/correctly used to make them as performance cheap as possible.

 



#4 Olof Lagerkvist

Olof Lagerkvist

    Gold Member

  • Developer
  • 1448 posts
  • Location:Borås, Sweden
  •  
    Sweden

Posted 22 July 2014 - 01:05 PM

Another idea, have you tried to implement this using the tcp/ip communication mechanism instead of shared memory? It adds some more memory buffer copying back and forth, but could, theoretically, be more efficient when it comes to synchronization so the total performance cost is not necessarily higher. It could be worth trying.



#5 v77

v77

    Silver Member

  • Team Reboot
  • 602 posts
  •  
    France

Posted 22 July 2014 - 02:01 PM

A single copy of data already has a huge cost in performance for a ramdisk. So, even with a faster synchronization, we will lose performances in case of requests of large size.

A solution might be to put several requests in the header of the shared buffer if they are available and if total size can fit in the buffer, but it would require extra processing, so it would be interesting only if it can be implemented very efficiently. Moreover, this would likely not be compatible.



#6 Olof Lagerkvist

Olof Lagerkvist

    Gold Member

  • Developer
  • 1448 posts
  • Location:Borås, Sweden
  •  
    Sweden

Posted 23 July 2014 - 07:26 PM

Additional buffer copying is not always as bad as it looks. Many performance tuning changes in ImDisk over the years have actually involved additional buffer copies, preferably so that time critical parts of the driver can manipulate data in a non-pageable copy instead of the original pageable buffer. Even with an additional buffer copy operation or two, this could actually mean less copying in the physical hardware, because the kernel can keep using the non-pageable buffer in the same place during its lifetime.

 

The shared memory communication involves a relatively large shared memory block for the actual I/O data and this shared memory is always pageable. TCP/IP communication, on the other hand, involves IRP communication with smaller blocks from kernel pool memory, both pageable and non-pageable. Those blocks are not as likely to move in physical memory as a large shared memory block.

 

Note that "pageable" in this case does not necessarily mean to page out to disk, it could mean to copy pages internally in physical RAM to optimize memory usage and find larger free contiguous blocks, which the kernel does once in a while when needed.

 



#7 v77

v77

    Silver Member

  • Team Reboot
  • 602 posts
  •  
    France

Posted 29 October 2019 - 06:56 PM

This performance issue seems to have disappeared in Windows 10.
I cannot compare my new results since CrystalDiskMark has changed its tests but except for one test of CrystalDiskMark did on my desktop (with only a very small slowdown), SignalObjectAndWait now provides better performance in every case.

I also tested NtSignalAndWaitForSingleObject, which is the counterpart of SignalObjectAndWait in the NT API, and as expected, the results are slightly better.


  • Olof Lagerkvist likes this




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users