mirror of https://github.com/postgres/postgres
Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56mpull/208/head
parent
5aec7e07fb
commit
fdd146a8ef
@ -0,0 +1,424 @@ |
||||
# Asynchronous & Direct IO |
||||
|
||||
## Motivation |
||||
|
||||
### Why Asynchronous IO |
||||
|
||||
Until the introduction of asynchronous IO postgres relied on the operating |
||||
system to hide the cost of synchronous IO from postgres. While this worked |
||||
surprisingly well in a lot of workloads, it does not do as good a job on |
||||
prefetching and controlled writeback as we would like. |
||||
|
||||
There are important expensive operations like `fdatasync()` where the operating |
||||
system cannot hide the storage latency. This is particularly important for WAL |
||||
writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC |
||||
writes can yield significantly higher throughput. |
||||
|
||||
|
||||
### Why Direct / unbuffered IO |
||||
|
||||
The main reasons to want to use Direct IO are: |
||||
|
||||
- Lower CPU usage / higher throughput. Particularly on modern storage buffered |
||||
writes are bottlenecked by the operating system having to copy data from the |
||||
kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO |
||||
can often move the data directly between the storage devices and postgres' |
||||
buffer cache, using DMA. While that transfer is ongoing, the CPU is free to |
||||
perform other work. |
||||
- Reduced latency - Direct IO can have substantially lower latency than |
||||
buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL |
||||
write latency. |
||||
- Avoiding double buffering between operating system cache and postgres' |
||||
shared_buffers. |
||||
- Better control over the timing and pace of dirty data writeback. |
||||
|
||||
|
||||
The main reasons *not* to use Direct IO are: |
||||
|
||||
- Without AIO, Direct IO is unusably slow for most purposes. |
||||
- Even with AIO, many parts of postgres need to be modified to perform |
||||
explicit prefetching. |
||||
- In situations where shared_buffers cannot be set appropriately large, |
||||
e.g. because there are many different postgres instances hosted on shared |
||||
hardware, performance will often be worse than when using buffered IO. |
||||
|
||||
|
||||
## AIO Usage Example |
||||
|
||||
In many cases code that can benefit from AIO does not directly have to |
||||
interact with the AIO interface, but can use AIO via higher-level |
||||
abstractions. See [Helpers](#helpers). |
||||
|
||||
In this example, a buffer will be read into shared buffers. |
||||
|
||||
```C |
||||
/* |
||||
* Result of the operation, only to be accessed in this backend. |
||||
*/ |
||||
PgAioReturn ioret; |
||||
|
||||
/* |
||||
* Acquire an AIO Handle, ioret will get result upon completion. |
||||
* |
||||
* Note that ioret needs to stay alive until the IO completes or |
||||
* CurrentResourceOwner is released (i.e. an error is thrown). |
||||
*/ |
||||
PgAioHandle *ioh = pgaio_io_acquire(CurrentResourceOwner, &ioret); |
||||
|
||||
/* |
||||
* Reference that can be used to wait for the IO we initiate below. This |
||||
* reference can reside in local or shared memory and waited upon by any |
||||
* process. An arbitrary number of references can be made for each IO. |
||||
*/ |
||||
PgAioWaitRef iow; |
||||
|
||||
pgaio_io_get_wref(ioh, &iow); |
||||
|
||||
/* |
||||
* Arrange for shared buffer completion callbacks to be called upon completion |
||||
* of the IO. This callback will update the buffer descriptors associated with |
||||
* the AioHandle, which e.g. allows other backends to access the buffer. |
||||
* |
||||
* A callback can be passed a small bit of data, e.g. to indicate whether to |
||||
* zero a buffer if it is invalid. |
||||
* |
||||
* Multiple completion callbacks can be registered for each handle. |
||||
*/ |
||||
pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0); |
||||
|
||||
/* |
||||
* The completion callback needs to know which buffers to update when the IO |
||||
* completes. As the AIO subsystem does not know about buffers, we have to |
||||
* associate this information with the AioHandle, for use by the completion |
||||
* callback registered above. |
||||
* |
||||
* In this example we're reading only a single buffer, hence the 1. |
||||
*/ |
||||
pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1); |
||||
|
||||
/* |
||||
* Pass the AIO handle to lower-level function. When operating on the level of |
||||
* buffers, we don't know how exactly the IO is performed, that is the |
||||
* responsibility of the storage manager implementation. |
||||
* |
||||
* E.g. md.c needs to translate block numbers into offsets in segments. |
||||
* |
||||
* Once the IO handle has been handed off to smgstartreadv(), it may not |
||||
* further be used, as the IO may immediately get executed below |
||||
* smgrstartreadv() and the handle reused for another IO. |
||||
* |
||||
* To issue multiple IOs in an efficient way, a caller can call |
||||
* pgaio_enter_batchmode() before starting multiple IOs, and end that batch |
||||
* with pgaio_exit_batchmode(). Note that one needs to be careful while there |
||||
* may be unsubmitted IOs, as another backend may need to wait for one of the |
||||
* unsubmitted IOs. If this backend then had to wait for the other backend, |
||||
* it'd end in an undetected deadlock. See pgaio_enter_batchmode() for more |
||||
* details. |
||||
* |
||||
* Note that even while in batchmode an IO might get submitted immediately, |
||||
* e.g. due to reaching a limit on the number of unsubmitted IOs, and even |
||||
* complete before smgrstartreadv() returns. |
||||
*/ |
||||
smgrstartreadv(ioh, operation->smgr, forknum, blkno, |
||||
BufferGetBlock(buffer), 1); |
||||
|
||||
/* |
||||
* To benefit from AIO, it is beneficial to perform other work, including |
||||
* submitting other IOs, before waiting for the IO to complete. Otherwise |
||||
* we could just have used synchronous, blocking IO. |
||||
*/ |
||||
perform_other_work(); |
||||
|
||||
/* |
||||
* We did some other work and now need the IO operation to have completed to |
||||
* continue. |
||||
*/ |
||||
pgaio_wref_wait(&iow); |
||||
|
||||
/* |
||||
* At this point the IO has completed. We do not yet know whether it succeeded |
||||
* or failed, however. The buffer's state has been updated, which allows other |
||||
* backends to use the buffer (if the IO succeeded), or retry the IO (if it |
||||
* failed). |
||||
* |
||||
* Note that in case the IO has failed, a LOG message may have been emitted, |
||||
* but no ERROR has been raised. This is crucial, as another backend waiting |
||||
* for this IO should not see an ERROR. |
||||
* |
||||
* To check whether the operation succeeded, and to raise an ERROR, or if more |
||||
* appropriate LOG, the PgAioReturn we passed to pgaio_io_acquire() is used. |
||||
*/ |
||||
if (ioret.result.status == PGAIO_RS_ERROR) |
||||
pgaio_result_report(ioret.result, &ioret.target_data, ERROR); |
||||
|
||||
/* |
||||
* Besides having succeeded completely, the IO could also have a) partially |
||||
* completed or b) succeeded with a warning (e.g. due to zero_damaged_pages). |
||||
* If we e.g. tried to read many blocks at once, the read might have |
||||
* only succeeded for the first few blocks. |
||||
* |
||||
* If the IO partially succeeded and this backend needs all blocks to have |
||||
* completed, this backend needs to reissue the IO for the remaining buffers. |
||||
* The AIO subsystem cannot handle this retry transparently. |
||||
* |
||||
* As this example is already long, and we only read a single block, we'll just |
||||
* error out if there's a partial read or a warning. |
||||
*/ |
||||
if (ioret.result.status != PGAIO_RS_OK) |
||||
pgaio_result_report(ioret.result, &ioret.target_data, ERROR); |
||||
|
||||
/* |
||||
* The IO succeeded, so we can use the buffer now. |
||||
*/ |
||||
``` |
||||
|
||||
|
||||
## Design Criteria & Motivation |
||||
|
||||
### Deadlock and Starvation Dangers due to AIO |
||||
|
||||
Using AIO in a naive way can easily lead to deadlocks in an environment where |
||||
the source/target of AIO are shared resources, like pages in postgres' |
||||
shared_buffers. |
||||
|
||||
Consider one backend performing readahead on a table, initiating IO for a |
||||
number of buffers ahead of the current "scan position". If that backend then |
||||
performs some operation that blocks, or even just is slow, the IO completion |
||||
for the asynchronously initiated read may not be processed. |
||||
|
||||
This AIO implementation solves this problem by requiring that AIO methods |
||||
either allow AIO completions to be processed by any backend in the system |
||||
(e.g. io_uring), or to guarantee that AIO processing will happen even when the |
||||
issuing backend is blocked (e.g. worker mode, which offloads completion |
||||
processing to the AIO workers). |
||||
|
||||
|
||||
### IO can be started in critical sections |
||||
|
||||
Using AIO for WAL writes can reduce the overhead of WAL logging substantially: |
||||
|
||||
- AIO allows to start WAL writes eagerly, so they complete before needing to |
||||
wait |
||||
- AIO allows to have multiple WAL flushes in progress at the same time |
||||
- AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce |
||||
the number of roundtrips to storage on some OSs and storage HW (buffered IO |
||||
and direct IO without O_DSYNC needs to issue a write and after the write's |
||||
completion a cache flush, whereas O\_DIRECT + O\_DSYNC can use a single |
||||
Force Unit Access (FUA) write). |
||||
|
||||
The need to be able to execute IO in critical sections has substantial design |
||||
implication on the AIO subsystem. Mainly because completing IOs (see prior |
||||
section) needs to be possible within a critical section, even if the |
||||
to-be-completed IO itself was not issued in a critical section. Consider |
||||
e.g. the case of a backend first starting a number of writes from shared |
||||
buffers and then starting to flush the WAL. Because only a limited amount of |
||||
IO can be in-progress at the same time, initiating IO for flushing the WAL may |
||||
require to first complete IO that was started earlier. |
||||
|
||||
|
||||
### State for AIO needs to live in shared memory |
||||
|
||||
Because postgres uses a process model and because AIOs need to be |
||||
complete-able by any backend much of the state of the AIO subsystem needs to |
||||
live in shared memory. |
||||
|
||||
In an `EXEC_BACKEND` build, a backend's executable code and other process |
||||
local state is not necessarily mapped to the same addresses in each process |
||||
due to ASLR. This means that the shared memory cannot contain pointers to |
||||
callbacks. |
||||
|
||||
|
||||
## Design of the AIO Subsystem |
||||
|
||||
|
||||
### AIO Methods |
||||
|
||||
To achieve portability and performance, multiple methods of performing AIO are |
||||
implemented and others are likely worth adding in the future. |
||||
|
||||
|
||||
#### Synchronous Mode |
||||
|
||||
`io_method=sync` does not actually perform AIO but allows to use the AIO API |
||||
while performing synchronous IO. This can be useful for debugging. The code |
||||
for the synchronous mode is also used as a fallback by e.g. the [worker |
||||
mode](#worker) uses it to execute IO that cannot be executed by workers. |
||||
|
||||
|
||||
#### Worker |
||||
|
||||
`io_method=worker` is available on every platform postgres runs on, and |
||||
implements asynchronous IO - from the view of the issuing process - by |
||||
dispatching the IO to one of several worker processes performing the IO in a |
||||
synchronous manner. |
||||
|
||||
|
||||
#### io_uring |
||||
|
||||
`io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it |
||||
dispatches all IO from within the process, lowering context switch rate / |
||||
latency. |
||||
|
||||
|
||||
### AIO Handles |
||||
|
||||
The central API piece for postgres' AIO abstraction are AIO handles. To |
||||
execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and |
||||
then "define" it, i.e. associate an IO operation with the handle. |
||||
|
||||
Often AIO handles are acquired on a higher level and then passed to a lower |
||||
level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c |
||||
routines acquire the handle, which is then passed through smgr.c, md.c to be |
||||
finally fully defined in fd.c. |
||||
|
||||
The functions used at the lowest level to define the operation are |
||||
`pgaio_io_start_*()`. |
||||
|
||||
Because acquisition of an IO handle |
||||
[must always succeed](#io-can-be-started-in-critical-sections) |
||||
and the number of AIO Handles |
||||
[has to be limited](#state-for-aio-needs-to-live-in-shared-memory) |
||||
AIO handles can be reused as soon as they have completed. Obviously code needs |
||||
to be able to react to IO completion. State can be updated using |
||||
[AIO Completion callbacks](#aio-callbacks) |
||||
and the issuing backend can provide a backend local variable to receive the |
||||
result of the IO, as described in |
||||
[AIO Result](#aio-results). |
||||
An IO can be waited for, by both the issuing and any other backend, using |
||||
[AIO References](#aio-wait-references). |
||||
|
||||
|
||||
Because an AIO Handle is not executable just after calling |
||||
`pgaio_io_acquire()` and because `pgaio_io_acquire()` needs to always succeed |
||||
(absent a PANIC), only a single AIO Handle may be acquired (i.e. returned by |
||||
`pgaio_io_acquire()`) without causing the IO to have been defined (by, |
||||
potentially indirectly, causing `pgaio_io_start_*()` to have been |
||||
called). Otherwise a backend could trivially self-deadlock by using up all AIO |
||||
Handles without the ability to wait for some of the IOs to complete. |
||||
|
||||
If it turns out that an AIO Handle is not needed, e.g., because the handle was |
||||
acquired before holding a contended lock, it can be released without being |
||||
defined using `pgaio_io_release()`. |
||||
|
||||
|
||||
### AIO Callbacks |
||||
|
||||
Commonly several layers need to react to completion of an IO. E.g. for a read |
||||
md.c needs to check if the IO outright failed or was shorter than needed, |
||||
bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the |
||||
BufferDesc to update the buffer's state. |
||||
|
||||
The fact that several layers / subsystems need to react to IO completion poses |
||||
a few challenges: |
||||
|
||||
- Upper layers should not need to know details of lower layers. E.g. bufmgr.c |
||||
should not assume the IO will pass through md.c. Therefore upper levels |
||||
cannot know what lower layers would consider an error. |
||||
|
||||
- Lower layers should not need to know about upper layers. E.g. smgr APIs are |
||||
used going through shared buffers but are also used bypassing shared |
||||
buffers. This means that e.g. md.c is not in a position to validate |
||||
checksums. |
||||
|
||||
- Having code in the AIO subsystem for every possible combination of layers |
||||
would lead to a lot of duplication. |
||||
|
||||
The "solution" to this is the ability to associate multiple completion |
||||
callbacks with a handle. E.g. bufmgr.c can have a callback to update the |
||||
BufferDesc state and to verify the page and md.c can have another callback to |
||||
check if the IO operation was successful. |
||||
|
||||
As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory |
||||
currently cannot contain function pointers. Because of that completion |
||||
callbacks are not directly identified by function pointers but by IDs |
||||
(`PgAioHandleCallbackID`). A substantial added benefit is that that |
||||
allows callbacks to be identified by much smaller amount of memory (a single |
||||
byte currently). |
||||
|
||||
In addition to completion, AIO callbacks also are called to "stage" an |
||||
IO. This is, e.g., used to increase buffer reference counts to account for the |
||||
AIO subsystem referencing the buffer, which is required to handle the case |
||||
where the issuing backend errors out and releases its own pins while the IO is |
||||
still ongoing. |
||||
|
||||
As [explained earlier](#io-can-be-started-in-critical-sections) IO completions |
||||
need to be safe to execute in critical sections. To allow the backend that |
||||
issued the IO to error out in case of failure [AIO Result](#aio-results) can |
||||
be used. |
||||
|
||||
|
||||
### AIO Targets |
||||
|
||||
In addition to the completion callbacks describe above, each AIO Handle has |
||||
exactly one "target". Each target has some space inside an AIO Handle with |
||||
information specific to the target and can provide callbacks to allow to |
||||
reopen the underlying file (required for worker mode) and to describe the IO |
||||
operation (used for debug logging and error messages). |
||||
|
||||
I.e., if two different uses of AIO can describe the identity of the file being |
||||
operated on the same way, it likely makes sense to use the same |
||||
target. E.g. different smgr implementations can describe IO with |
||||
RelFileLocator, ForkNumber and BlockNumber and can thus share a target. In |
||||
contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr |
||||
and it would not make sense to use the same target for smgr and WAL. |
||||
|
||||
|
||||
### AIO Wait References |
||||
|
||||
As [described above](#aio-handles), AIO Handles can be reused immediately |
||||
after completion and therefore cannot be used to wait for completion of the |
||||
IO. Waiting is enabled using AIO wait references, which do not just identify |
||||
an AIO Handle but also include the handles "generation". |
||||
|
||||
A reference to an AIO Handle can be acquired using `pgaio_io_get_wref()` and |
||||
then waited upon using `pgaio_wref_wait()`. |
||||
|
||||
|
||||
### AIO Results |
||||
|
||||
As AIO completion callbacks |
||||
[are executed in critical sections](#io-can-be-started-in-critical-sections) |
||||
and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio) |
||||
completion callbacks cannot be used to, e.g., make the query that triggered an |
||||
IO ERROR out. |
||||
|
||||
To allow to react to failing IOs the issuing backend can pass a pointer to a |
||||
`PgAioReturn` in backend local memory. Before an AIO Handle is reused the |
||||
`PgAioReturn` is filled with information about the IO. This includes |
||||
information about whether the IO was successful (as a value of |
||||
`PgAioResultStatus`) and enough information to raise an error in case of a |
||||
failure (via `pgaio_result_report()`, with the error details encoded in |
||||
`PgAioResult`). |
||||
|
||||
|
||||
### AIO Errors |
||||
|
||||
It would be very convenient to have shared completion callbacks encode the |
||||
details of errors as an `ErrorData` that could be raised at a later |
||||
time. Unfortunately doing so would require allocating memory. While elog.c can |
||||
guarantee (well, kinda) that logging a message will not run out of memory, |
||||
that only works because a very limited number of messages are in the process |
||||
of being logged. With AIO a large number of concurrently issued AIOs might |
||||
fail. |
||||
|
||||
To avoid the need for preallocating a potentially large amount of memory (in |
||||
shared memory no less!), completion callbacks instead have to encode errors in |
||||
a more compact format that can be converted into an error message. |
||||
|
||||
|
||||
## Helpers |
||||
|
||||
Using the low-level AIO API introduces too much complexity to do so all over |
||||
the tree. Most uses of AIO should be done via reusable, higher-level, |
||||
helpers. |
||||
|
||||
|
||||
### Read Stream |
||||
|
||||
A common and very beneficial use of AIO are reads where a substantial number |
||||
of to-be-read locations are known ahead of time. E.g., for a sequential scan |
||||
the set of blocks that need to be read can be determined solely by knowing the |
||||
current position and checking the buffer mapping table. |
||||
|
||||
The [Read Stream](../../../include/storage/read_stream.h) interface makes it |
||||
comparatively easy to use AIO for such use cases. |
Loading…
Reference in new issue