aio: Add README.md explaining higher level design

Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
3 months ago · fdd146a8ef
parent 5aec7e07fb
commit fdd146a8ef
2 changed files with 426 additions and 0 deletions
--- a/src/backend/storage/aio/README.md
+++ b/src/backend/storage/aio/README.md
@ -0,0 +1,424 @@
 # Asynchronous & Direct IO
 ## Motivation
 ### Why Asynchronous IO
 Until the introduction of asynchronous IO postgres relied on the operating
 system to hide the cost of synchronous IO from postgres. While this worked
 surprisingly well in a lot of workloads, it does not do as good a job on
 prefetching and controlled writeback as we would like.
 There are important expensive operations like `fdatasync()` where the operating
 system cannot hide the storage latency. This is particularly important for WAL
 writes, where the ability to asynchronously issue `fdatasync()` or O_DSYNC
 writes can yield significantly higher throughput.
 ### Why Direct / unbuffered IO
 The main reasons to want to use Direct IO are:
 - Lower CPU usage / higher throughput. Particularly on modern storage buffered
  writes are bottlenecked by the operating system having to copy data from the
  kernel's page cache to postgres buffer pool using the CPU. Whereas direct IO
  can often move the data directly between the storage devices and postgres'
  buffer cache, using DMA. While that transfer is ongoing, the CPU is free to
  perform other work.
 - Reduced latency - Direct IO can have substantially lower latency than
  buffered IO, which can be impactful for OLTP workloads bottlenecked by WAL
  write latency.
 - Avoiding double buffering between operating system cache and postgres'
  shared_buffers.
 - Better control over the timing and pace of dirty data writeback.
 The main reasons *not* to use Direct IO are:
 - Without AIO, Direct IO is unusably slow for most purposes.
 - Even with AIO, many parts of postgres need to be modified to perform
  explicit prefetching.
 - In situations where shared_buffers cannot be set appropriately large,
  e.g. because there are many different postgres instances hosted on shared
  hardware, performance will often be worse than when using buffered IO.
 ## AIO Usage Example
 In many cases code that can benefit from AIO does not directly have to
 interact with the AIO interface, but can use AIO via higher-level
 abstractions. See [Helpers](#helpers).
 In this example, a buffer will be read into shared buffers.
 ```C
 /*
 * Result of the operation, only to be accessed in this backend.
 */
 PgAioReturn ioret;
 /*
 * Acquire an AIO Handle, ioret will get result upon completion.
 *
 * Note that ioret needs to stay alive until the IO completes or
 * CurrentResourceOwner is released (i.e. an error is thrown).
 */
 PgAioHandle *ioh = pgaio_io_acquire(CurrentResourceOwner, &ioret);
 /*
 * Reference that can be used to wait for the IO we initiate below. This
 * reference can reside in local or shared memory and waited upon by any
 * process. An arbitrary number of references can be made for each IO.
 */
 PgAioWaitRef iow;
 pgaio_io_get_wref(ioh, &iow);
 /*
 * Arrange for shared buffer completion callbacks to be called upon completion
 * of the IO. This callback will update the buffer descriptors associated with
 * the AioHandle, which e.g. allows other backends to access the buffer.
 *
 * A callback can be passed a small bit of data, e.g. to indicate whether to
 * zero a buffer if it is invalid.
 *
 * Multiple completion callbacks can be registered for each handle.
 */
 pgaio_io_register_callbacks(ioh, PGAIO_HCB_SHARED_BUFFER_READV, 0);
 /*
 * The completion callback needs to know which buffers to update when the IO
 * completes. As the AIO subsystem does not know about buffers, we have to
 * associate this information with the AioHandle, for use by the completion
 * callback registered above.
 *
 * In this example we're reading only a single buffer, hence the 1.
 */
 pgaio_io_set_handle_data_32(ioh, (uint32 *) buffer, 1);
 /*
 * Pass the AIO handle to lower-level function. When operating on the level of
 * buffers, we don't know how exactly the IO is performed, that is the
 * responsibility of the storage manager implementation.
 *
 * E.g. md.c needs to translate block numbers into offsets in segments.
 *
 * Once the IO handle has been handed off to smgstartreadv(), it may not
 * further be used, as the IO may immediately get executed below
 * smgrstartreadv() and the handle reused for another IO.
 *
 * To issue multiple IOs in an efficient way, a caller can call
 * pgaio_enter_batchmode() before starting multiple IOs, and end that batch
 * with pgaio_exit_batchmode().  Note that one needs to be careful while there
 * may be unsubmitted IOs, as another backend may need to wait for one of the
 * unsubmitted IOs. If this backend then had to wait for the other backend,
 * it'd end in an undetected deadlock. See pgaio_enter_batchmode() for more
 * details.
 *
 * Note that even while in batchmode an IO might get submitted immediately,
 * e.g. due to reaching a limit on the number of unsubmitted IOs, and even
 * complete before smgrstartreadv() returns.
 */
 smgrstartreadv(ioh, operation->smgr, forknum, blkno,
               BufferGetBlock(buffer), 1);
 /*
 * To benefit from AIO, it is beneficial to perform other work, including
 * submitting other IOs, before waiting for the IO to complete. Otherwise
 * we could just have used synchronous, blocking IO.
 */
 perform_other_work();
 /*
 * We did some other work and now need the IO operation to have completed to
 * continue.
 */
 pgaio_wref_wait(&iow);
 /*
 * At this point the IO has completed. We do not yet know whether it succeeded
 * or failed, however. The buffer's state has been updated, which allows other
 * backends to use the buffer (if the IO succeeded), or retry the IO (if it
 * failed).
 *
 * Note that in case the IO has failed, a LOG message may have been emitted,
 * but no ERROR has been raised. This is crucial, as another backend waiting
 * for this IO should not see an ERROR.
 *
 * To check whether the operation succeeded, and to raise an ERROR, or if more
 * appropriate LOG, the PgAioReturn we passed to pgaio_io_acquire() is used.
 */
 if (ioret.result.status == PGAIO_RS_ERROR)
    pgaio_result_report(ioret.result, &ioret.target_data, ERROR);
 /*
 * Besides having succeeded completely, the IO could also have a) partially
 * completed or b) succeeded with a warning (e.g. due to zero_damaged_pages).
 * If we e.g. tried to read many blocks at once, the read might have
 * only succeeded for the first few blocks.
 *
 * If the IO partially succeeded and this backend needs all blocks to have
 * completed, this backend needs to reissue the IO for the remaining buffers.
 * The AIO subsystem cannot handle this retry transparently.
 *
 * As this example is already long, and we only read a single block, we'll just
 * error out if there's a partial read or a warning.
 */
 if (ioret.result.status != PGAIO_RS_OK)
    pgaio_result_report(ioret.result, &ioret.target_data, ERROR);
 /*
 * The IO succeeded, so we can use the buffer now.
 */
 ```
 ## Design Criteria & Motivation
 ### Deadlock and Starvation Dangers due to AIO
 Using AIO in a naive way can easily lead to deadlocks in an environment where
 the source/target of AIO are shared resources, like pages in postgres'
 shared_buffers.
 Consider one backend performing readahead on a table, initiating IO for a
 number of buffers ahead of the current "scan position". If that backend then
 performs some operation that blocks, or even just is slow, the IO completion
 for the asynchronously initiated read may not be processed.
 This AIO implementation solves this problem by requiring that AIO methods
 either allow AIO completions to be processed by any backend in the system
 (e.g. io_uring), or to guarantee that AIO processing will happen even when the
 issuing backend is blocked (e.g. worker mode, which offloads completion
 processing to the AIO workers).
 ### IO can be started in critical sections
 Using AIO for WAL writes can reduce the overhead of WAL logging substantially:
 - AIO allows to start WAL writes eagerly, so they complete before needing to
  wait
 - AIO allows to have multiple WAL flushes in progress at the same time
 - AIO makes it more realistic to use O\_DIRECT + O\_DSYNC, which can reduce
  the number of roundtrips to storage on some OSs and storage HW (buffered IO
  and direct IO without O_DSYNC needs to issue a write and after the write's
  completion a cache flush, whereas O\_DIRECT + O\_DSYNC can use a single
  Force Unit Access (FUA) write).
 The need to be able to execute IO in critical sections has substantial design
 implication on the AIO subsystem. Mainly because completing IOs (see prior
 section) needs to be possible within a critical section, even if the
 to-be-completed IO itself was not issued in a critical section. Consider
 e.g. the case of a backend first starting a number of writes from shared
 buffers and then starting to flush the WAL. Because only a limited amount of
 IO can be in-progress at the same time, initiating IO for flushing the WAL may
 require to first complete IO that was started earlier.
 ### State for AIO needs to live in shared memory
 Because postgres uses a process model and because AIOs need to be
 complete-able by any backend much of the state of the AIO subsystem needs to
 live in shared memory.
 In an `EXEC_BACKEND` build, a backend's executable code and other process
 local state is not necessarily mapped to the same addresses in each process
 due to ASLR. This means that the shared memory cannot contain pointers to
 callbacks.
 ## Design of the AIO Subsystem
 ### AIO Methods
 To achieve portability and performance, multiple methods of performing AIO are
 implemented and others are likely worth adding in the future.
 #### Synchronous Mode
 `io_method=sync` does not actually perform AIO but allows to use the AIO API
 while performing synchronous IO. This can be useful for debugging. The code
 for the synchronous mode is also used as a fallback by e.g. the [worker
 mode](#worker) uses it to execute IO that cannot be executed by workers.
 #### Worker
 `io_method=worker` is available on every platform postgres runs on, and
 implements asynchronous IO - from the view of the issuing process - by
 dispatching the IO to one of several worker processes performing the IO in a
 synchronous manner.
 #### io_uring
 `io_method=io_uring` is available on Linux 5.1+. In contrast to worker mode it
 dispatches all IO from within the process, lowering context switch rate /
 latency.
 ### AIO Handles
 The central API piece for postgres' AIO abstraction are AIO handles. To
 execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) and
 then "define" it, i.e. associate an IO operation with the handle.
 Often AIO handles are acquired on a higher level and then passed to a lower
 level to be fully defined. E.g., for IO to/from shared buffers, bufmgr.c
 routines acquire the handle, which is then passed through smgr.c, md.c to be
 finally fully defined in fd.c.
 The functions used at the lowest level to define the operation are
 `pgaio_io_start_*()`.
 Because acquisition of an IO handle
 [must always succeed](#io-can-be-started-in-critical-sections)
 and the number of AIO Handles
 [has to be limited](#state-for-aio-needs-to-live-in-shared-memory)
 AIO handles can be reused as soon as they have completed. Obviously code needs
 to be able to react to IO completion. State can be updated using
 [AIO Completion callbacks](#aio-callbacks)
 and the issuing backend can provide a backend local variable to receive the
 result of the IO, as described in
 [AIO Result](#aio-results).
 An IO can be waited for, by both the issuing and any other backend, using
 [AIO References](#aio-wait-references).
 Because an AIO Handle is not executable just after calling
 `pgaio_io_acquire()` and because `pgaio_io_acquire()` needs to always succeed
 (absent a PANIC), only a single AIO Handle may be acquired (i.e. returned by
 `pgaio_io_acquire()`) without causing the IO to have been defined (by,
 potentially indirectly, causing `pgaio_io_start_*()` to have been
 called). Otherwise a backend could trivially self-deadlock by using up all AIO
 Handles without the ability to wait for some of the IOs to complete.
 If it turns out that an AIO Handle is not needed, e.g., because the handle was
 acquired before holding a contended lock, it can be released without being
 defined using `pgaio_io_release()`.
 ### AIO Callbacks
 Commonly several layers need to react to completion of an IO. E.g. for a read
 md.c needs to check if the IO outright failed or was shorter than needed,
 bufmgr.c needs to verify the page looks valid and bufmgr.c needs to update the
 BufferDesc to update the buffer's state.
 The fact that several layers / subsystems need to react to IO completion poses
 a few challenges:
 - Upper layers should not need to know details of lower layers. E.g. bufmgr.c
  should not assume the IO will pass through md.c.  Therefore upper levels
  cannot know what lower layers would consider an error.
 - Lower layers should not need to know about upper layers. E.g. smgr APIs are
  used going through shared buffers but are also used bypassing shared
  buffers. This means that e.g. md.c is not in a position to validate
  checksums.
 - Having code in the AIO subsystem for every possible combination of layers
  would lead to a lot of duplication.
 The "solution" to this is the ability to associate multiple completion
 callbacks with a handle. E.g. bufmgr.c can have a callback to update the
 BufferDesc state and to verify the page and md.c can have another callback to
 check if the IO operation was successful.
 As [mentioned](#state-for-aio-needs-to-live-in-shared-memory), shared memory
 currently cannot contain function pointers. Because of that completion
 callbacks are not directly identified by function pointers but by IDs
 (`PgAioHandleCallbackID`).  A substantial added benefit is that that
 allows callbacks to be identified by much smaller amount of memory (a single
 byte currently).
 In addition to completion, AIO callbacks also are called to "stage" an
 IO. This is, e.g., used to increase buffer reference counts to account for the
 AIO subsystem referencing the buffer, which is required to handle the case
 where the issuing backend errors out and releases its own pins while the IO is
 still ongoing.
 As [explained earlier](#io-can-be-started-in-critical-sections) IO completions
 need to be safe to execute in critical sections. To allow the backend that
 issued the IO to error out in case of failure [AIO Result](#aio-results) can
 be used.
 ### AIO Targets
 In addition to the completion callbacks describe above, each AIO Handle has
 exactly one "target". Each target has some space inside an AIO Handle with
 information specific to the target and can provide callbacks to allow to
 reopen the underlying file (required for worker mode) and to describe the IO
 operation (used for debug logging and error messages).
 I.e., if two different uses of AIO can describe the identity of the file being
 operated on the same way, it likely makes sense to use the same
 target. E.g. different smgr implementations can describe IO with
 RelFileLocator, ForkNumber and BlockNumber and can thus share a target. In
 contrast, IO for a WAL file would be described with TimeLineID and XLogRecPtr
 and it would not make sense to use the same target for smgr and WAL.
 ### AIO Wait References
 As [described above](#aio-handles), AIO Handles can be reused immediately
 after completion and therefore cannot be used to wait for completion of the
 IO. Waiting is enabled using AIO wait references, which do not just identify
 an AIO Handle but also include the handles "generation".
 A reference to an AIO Handle can be acquired using `pgaio_io_get_wref()` and
 then waited upon using `pgaio_wref_wait()`.
 ### AIO Results
 As AIO completion callbacks
 [are executed in critical sections](#io-can-be-started-in-critical-sections)
 and [may be executed by any backend](#deadlock-and-starvation-dangers-due-to-aio)
 completion callbacks cannot be used to, e.g., make the query that triggered an
 IO ERROR out.
 To allow to react to failing IOs the issuing backend can pass a pointer to a
 `PgAioReturn` in backend local memory. Before an AIO Handle is reused the
 `PgAioReturn` is filled with information about the IO. This includes
 information about whether the IO was successful (as a value of
 `PgAioResultStatus`) and enough information to raise an error in case of a
 failure (via `pgaio_result_report()`, with the error details encoded in
 `PgAioResult`).
 ### AIO Errors
 It would be very convenient to have shared completion callbacks encode the
 details of errors as an `ErrorData` that could be raised at a later
 time. Unfortunately doing so would require allocating memory. While elog.c can
 guarantee (well, kinda) that logging a message will not run out of memory,
 that only works because a very limited number of messages are in the process
 of being logged.  With AIO a large number of concurrently issued AIOs might
 fail.
 To avoid the need for preallocating a potentially large amount of memory (in
 shared memory no less!), completion callbacks instead have to encode errors in
 a more compact format that can be converted into an error message.
 ## Helpers
 Using the low-level AIO API introduces too much complexity to do so all over
 the tree. Most uses of AIO should be done via reusable, higher-level,
 helpers.
 ### Read Stream
 A common and very beneficial use of AIO are reads where a substantial number
 of to-be-read locations are known ahead of time. E.g., for a sequential scan
 the set of blocks that need to be read can be determined solely by knowing the
 current position and checking the buffer mapping table.
 The [Read Stream](../../../include/storage/read_stream.h) interface makes it
 comparatively easy to use AIO for such use cases.
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@ -24,6 +24,8 @@
 *
 * - read_stream.c - helper for reading buffered relation data
 *
 * - README.md - higher-level overview over AIO
 *
 *
 * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California