Jump to content











Photo
- - - - -

Help with DevIoDrv asynchronous operation (possible bugs)


  • Please log in to reply
6 replies to this topic

#1 benrg

benrg
  • Members
  • 5 posts
  •  
    United States

Posted 19 May 2022 - 06:54 AM

I'm trying to understand how to use the new DevIoDrv interface to make an asynchronous/overlapped server, and I'm running into problems (in my head).
 
I allocate and lock a buffer, and pass it to the driver with IOCTL_DEVIODRV_EXCHANGE_IO. When a client request comes in, the driver completes the ioctl. While I'm still working on that request, I send another buffer (or perhaps the same one) to the driver, so that if another request comes in, I can handle it before the first is complete. So far so good.
 
Problem #1: Say I finish the first request and no second request has come in. I have to use IOCTL_DEVIODRV_EXCHANGE_IO to complete the request, but that also queues the buffer for later client requests, so now there are 2 buffers stuck in kernel mode. Whenever a buffer returns to me, it comes with a request that I'll later have to complete, so it seems like the number of buffers can never decrease (except via STATUS_BUFFER_TOO_SMALL which is out of my control). If I serve 10 requests in parallel during a flurry of activity, I can't get rid of them when the system is idle because they're all owned by the driver.
 
It occurred to me that I might be able to simulate a "just complete, don't queue" call by canceling the IOCTL_DEVIODRV_EXCHANGE_IO when it returns with STATUS_PENDING, and I might be able to free the buffer if I also cancel the IOCTL_DEVIODRV_LOCK_MEMORY. But I've never canceled an IRP in my life, and I don't know whether it's safe. If you tell me it's safe and supported here, I'll do it.
 
Problem #2: Suppose the first request was a 100MB read. I give the driver a 64K buffer to listen for more requests while handling the 100MB read, but none come in. I complete the 100MB read. Now a 100MB write request comes in. The driver seems to select a buffer in FIFO order with no regard for size, then complete with STATUS_BUFFER_TOO_SMALL if it's too small. So it will return my 64K buffer to me. Now the driver has a 100MB buffer of mine, and a request that could fit in it, but they're both queued and it isn't doing anything. I have to allocate another 100MB buffer and pass it in to get the request. I think this is a bug.
 
Problem #3: Is there any upper limit on the buffer size required? What if somebody mmaps a large file, issues a huge write request directly from it, and it's impossible to allocate enough RAM to serve it? I'm dead in the water: I can't fail the write, because to fail it I would first have to obtain it, and I can't handle any other requests, because the driver won't give them to me.
 
Problem #3½: I noticed the DevIoDrv code fails with STATUS_BUFFER_TOO_SMALL even on read requests, even though the read request would fit in the buffer; it's the response that doesn't. Since responses can use a different buffer, I'd rather just get the request back. The user-mode devio code has resizing logic for read requests and doesn't seem to expect the driver to do it, so I'm wondering if this is a bug.
 


#2 Olof Lagerkvist

Olof Lagerkvist

    Gold Member

  • Developer
  • 1448 posts
  • Location:Borås, Sweden
  •  
    Sweden

Posted 19 May 2022 - 12:51 PM

Thanks for trying out this driver! It has not been used much more than with a few experiments, from what I can tell.

 

It is basically designed to have a fixed number of same sized buffers. The implementation is not aware of cases where you might want large requests to take a different code path by supplying differently sized buffers or anything similar. It is optimized for same size of all buffers.

 

There is no safe way to remove buffers while running and there is no practical way to handle requests too large to fit in memory. I have seen requests as large as 10 MB or so on latest versions of Windows running on machines with large amount of free RAM. For versions older than Windows 10 I would say requests are only a few MB or so. If you are thinking of memory mapped files in a file system on a virtual disk backed by ImDisk backed by deviodrv, there is never a scenario where it would try to page in more memory than could be fit in memory. The page-in operations are not done in a way that cause that large requests.

 

For testing purposes, you could use an application that tries to map a view of an entire file at once. Notepad does that for example.



#3 benrg

benrg
  • Members
  • 5 posts
  •  
    United States

Posted 19 May 2022 - 06:14 PM

Thanks for trying out this driver! It has not been used much more than with a few experiments, from what I can tell.

 
It seems as though it could be extremely useful for many application. Take a virtual disk backed by a file on a web server. This is nice for, e.g., extracting files from Microsoft ISOs without downloading the whole thing. But it's terribly slow even if you have a fast connection because there is a separate round trip to the server for each read request. To get acceptable performance you have to implement your own prefetching logic, which is silly when NT should be doing that work for you. With asynchronous request handling you can just forward the requests to the server directly and get good performance.

I think these sorts of applications (with potentially high throughput limited by high latency) are common, so I'm surprised you haven't hyped this change more.
 

It is basically designed to have a fixed number of same sized buffers. The implementation is not aware of cases where you might want large requests to take a different code path by supplying differently sized buffers or anything similar. It is optimized for same size of all buffers.

 
I would be happy to give it a fixed number of same-sized buffers, but the buffers have to be expanded to accommodate large requests, and the expanded buffers are stuck in kernel mode indefinitely at their expanded size until another request comes in, and even after I get them back, there's apparently no way to shrink them because I can't instruct the driver to unlock them. So I'll inevitably end up with large, and possibly unevenly-sized, buffers.

If somebody at any point does some 100MB writes, and I have four buffers so I can handle four requests at a time (I'd rather have more), I could end up with 400MB of wasted RAM for the lifetime of the server. 400MB of locked physical RAM unless I misunderstand what the LOCK ioctl does. Times however many virtual disks are mounted.

It would be easy to add an ioctl that would complete any passed-in request just like EXCHANGE_IO, then immediately unlock the buffer and complete the ioctl just like the STATUS_BUFFER_TOO_SMALL path. That would make it possible to shrink large buffers immediately after using them, which would help significantly.

It would be harder but much more robust to have a solution that wouldn't require allocating arbitrarily large amounts of RAM even temporarily, such as mapping large IRP buffers directly into user space (is that safe? it seems safe), or having separate ioctls to transfer hunks of data to/from the IRP buffers instead of memcpying it all at once into the server's buffer.

I appreciate your focus on speed, but I think you're looking at it the wrong way. The biggest potential speed gain from this protocol is from the inherent parallelism of it, not reduced ioctl overhead. Only small reads and writes need to be fast. Large reads and writes are rare, and just need to work robustly.
 

If you are thinking of memory mapped files in a file system on a virtual disk backed by ImDisk backed by deviodrv, there is never a scenario where it would try to page in more memory than could be fit in memory.

 
What I was thinking of is somebody mmapping a large file (on any disk, real or virtual), and then issuing a single write, to a different file on a virtual drive, whose source buffer is the mmapped region. The buffer is not in RAM, and isn't even pagefile-backed. But to handle that request with the current design, I have to create an equally large buffer that does live in RAM, which may be impossible.

Maybe that was an overly complicated example. Even writes from a RAM buffer can be gigabytes in size. That's bad when you consider that it can get multiplied by the number of parallel requests you want to handle, and then again by the number of virtual disks that are mounted. And all of that is nonpageable, unless I misunderstand.

I realize such writes are unusual, but I don't want to just hope that they won't happen; I want the design of the system to be such that there is a provable upper bound on the worst-case memory usage. All of the other proxy protocols have that property (if you use a fixed-size R/W buffer for the stream-based ones).



#4 Olof Lagerkvist

Olof Lagerkvist

    Gold Member

  • Developer
  • 1448 posts
  • Location:Borås, Sweden
  •  
    Sweden

Posted 19 May 2022 - 06:25 PM

What I was thinking of is somebody mmapping a large file (on any disk, real or virtual), and then issuing a single write, to a different file on a virtual drive, whose source buffer is the mmapped region. The buffer is not in RAM, and isn't even pagefile-backed. But to handle that request with the current design, I have to create an equally large buffer that does live in RAM, which may be impossible.

Maybe that was an overly complicated example. Even writes from a RAM buffer can be gigabytes in size. That's bad when you consider that it can get multiplied by the number of parallel requests you want to handle, and then again by the number of virtual disks that are mounted. And all of that is nonpageable, unless I misunderstand.

I realize such writes are unusual, but I don't want to just hope that they won't happen; I want the design of the system to be such that there is a provable upper bound on the worst-case memory usage. All of the other proxy protocols have that property (if you use a fixed-size R/W buffer for the stream-based ones).


I am not sure how to explain it but no, it does not work in the way you think. If you memory map a 100 MB file in a file system that resides on an ImDisk volume backed by DevIoDrv, and you decide to map a view of the entire file into memory, the page-in requests that fills that memory are never even close to 100 MB in size. Also, if you write to that memory the page-out requests that writes data down to the disk will also never be that large.

 

With the other proxy protocols the buffer sizes are fixed and requests will simply fail if they are too large. Also, many of them (and many of intermediate filter drivers that may be present) need some kind of intermediate storage in non-paged pool memory to handle requests and those can certainly never be several GB in size.



#5 Olof Lagerkvist

Olof Lagerkvist

    Gold Member

  • Developer
  • 1448 posts
  • Location:Borås, Sweden
  •  
    Sweden

Posted 19 May 2022 - 07:18 PM

It seems as though it could be extremely useful for many application. Take a virtual disk backed by a file on a web server. This is nice for, e.g., extracting files from Microsoft ISOs without downloading the whole thing. But it's terribly slow even if you have a fast connection because there is a separate round trip to the server for each read request. To get acceptable performance you have to implement your own prefetching logic, which is silly when NT should be doing that work for you. With asynchronous request handling you can just forward the requests to the server directly and get good performance.


Asynchronous I/O would not help in cases like this. First of all, you would still need to wait for each round-trip to the server for each I/O request. That is the slow part, not that requests need to wait for each others. The file system driver will not be able to issue a large amount of I/O requests down to the disk driver in a way that would make any practical difference. You could easily see this by checking how many items end up in a queue for synchronous operation when using a synchronous backend for ISO images.
 

I think these sorts of applications (with potentially high throughput limited by high latency) are common, so I'm surprised you haven't hyped this change more.


The problem is rather that there are probably very few cases where this driver design would make any practical difference. It is a lot more complex to implement anything that uses this driver and the performance difference is not very big compared to the added complexity.



#6 benrg

benrg
  • Members
  • 5 posts
  •  
    United States

Posted 19 May 2022 - 09:18 PM

Asynchronous I/O would not help in cases like this.


It helps. The HTTP-disk server I mentioned before isn't hypothetical; I really wrote it. It really is faster with prefetching (by a lot).

OS prefetching means some program requests the first 32K of a file, then the next 32K, etc. and the OS eventually notices that it's reading sequentially and starts issuing reads for the pieces it anticipates the application will ask for next, early enough that they'll be ready when it does ask for them. Application prefetching means the application issues four 32K reads into four buffers, and when the first one completes and it's done processing it, it issues the fifth read using that buffer, etc.

Each read is small, and the total number of them in flight may be small. What makes them prefetches is the time at which they're issued.

With a synchronous user-kernel interface, and high latency, you get none of the benefit of the prefetching, because the timeline looks like this:

  _    _    _    _
 / \  / \  / \  / \
/   \/   \/   \/   \

where the devio server is at the bottom, the Internet server it talks to is at the top, and the horizontal lines are the time the Internet server needs to actually serve the requests.

The requests weren't issued at those times. They were probably issued so that they would complete like this:

  ____
 //XX\\
////\\\\

but the sending times in the first timeline represent the earliest time at which the devio server can see each request. It isn't told what the next request will be until it has completed the previous one, which it can't do until all of that request's data has come back over the wire.

The only way the devio server can get reasonable performance out of this interface is to guess what the next request will be, before it actually sees it – which is prefetching. The OS or application should be doing it, and they are doing it, but it has to be re-done.

This has bottlenecked the performance of ImDisk from the beginning, and I've always had it in the back of my mind to post about it here. I thought you never fixed it because it would be too much work, but maybe it was only because you didn't realize that fixing it would help performance.



#7 Olof Lagerkvist

Olof Lagerkvist

    Gold Member

  • Developer
  • 1448 posts
  • Location:Borås, Sweden
  •  
    Sweden

Posted 19 May 2022 - 09:31 PM

You could of course try to experiment with this if you think that it would make noticeable difference. When I implemented asynchronous support for physical memory backend (awealloc driver) there were some test cases that showed big differences in some disk benchmark tools, so I know it could happen.

 

The reason though why I have thought that it would not give much of a performance boost in most practical cases is because I have measured the queue depth in ImDisk when using synchronous proxy protocols. Even with high latency backends such as towards a devio services over tcp/ip there is not very often that many items in the queue. In most cases just one, sometimes two. Particularly when reading and writing large files, memory-mapping files etc. Therefore, my conclusion has always been that it is difficult to gain much performance in this way.






1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users