4TB large objects (standard 8KB BLCKSZ case). For this purpose new
libpq API lo_lseek64, lo_tell64 and lo_truncate64 are added. Also
corresponding new backend functions lo_lseek64, lo_tell64 and
lo_truncate64 are added. inv_api.c is changed to handle 64-bit
offsets.
Patch contributed by Nozomi Anzai (backend side) and Yugo Nagata
(frontend side, docs, regression tests and example program). Reviewed
by Kohei Kaigai. Committed by Tatsuo Ishii with minor editings.
snapmgmt.c file for the former. The header files have also been reorganized
in three parts: the most basic snapshot definitions are now in a new file
snapshot.h, and the also new snapmgmt.h keeps the definitions for snapmgmt.c.
tqual.h has been reduced to the bare minimum.
This patch is just a first step towards managing live snapshots within a
transaction; there is no functionality change.
Per my proposal to pgsql-patches on 20080318191940.GB27458@alvh.no-ip.org and
subsequent discussion.
set to the large object context ("fscxt"), as this is inevitably a source of
transaction-duration memory leaks. Not sure why we'd not noticed it before;
maybe people weren't touching a whole lot of LOs in the same transaction
before the 8.1 pg_dump changes. Per report from Wayne Conrad.
Backpatched as far as 8.1, but the problem doubtless goes all the way back.
I'm disinclined to spend the time to try to verify that the older branches
would still work if patched, seeing that this code was significantly modified
for 8.0 and again for 8.1, and that we don't have any trouble reports before
8.1. (Maybe the leaks were smaller before?)
a descriptor that uses the current transaction snapshot, rather than
SnapshotNow as it did before (and still does if INV_WRITE is set).
This means pg_dump will now dump a consistent snapshot of large object
contents, as it never could do before. Also, add a lo_create() function
that is similar to lo_creat() but allows the desired OID of the large
object to be specified. This will simplify pg_restore considerably
(but I'll fix that in a separate commit).
as per recent discussions. Invent SubTransactionIds that are managed like
CommandIds (ie, counter is reset at start of each top transaction), and
use these instead of TransactionIds to keep track of subtransaction status
in those modules that need it. This means that a subtransaction does not
need an XID unless it actually inserts/modifies rows in the database.
Accordingly, don't assign it an XID nor take a lock on the XID until it
tries to do that. This saves a lot of overhead for subtransactions that
are only used for error recovery (eg plpgsql exceptions). Also, arrange
to release a subtransaction's XID lock as soon as the subtransaction
exits, in both the commit and abort cases. This avoids holding many
unique locks after a long series of subtransactions. The price is some
additional overhead in XactLockTableWait, but that seems acceptable.
Finally, restructure the state machine in xact.c to have a more orthogonal
set of states for subtransactions.
password/group files. Also allow read-only subtransactions of a read-write
parent, but not vice versa. These are the reasonably noncontroversial
parts of Alvaro's recent mop-up patch, plus further work on large objects
to minimize use of the TopTransactionResourceOwner.
as full as possible, seems better to use a tuple size around BLCKSZ/4
so that less space is wasted when a LO tuple is updated. Also, this
lets us use a logical page size that's an exact power of two, avoiding
partial-page writes when client is sending us stuff in power-of-2
buffer chunks.
kibitzing from Tom Lane. Large objects are now all stored in a single
system relation "pg_largeobject" --- no more xinv or xinx files, no more
relkind 'l'. This should offer substantial performance improvement for
large numbers of LOs, since there won't be directory bloat anymore.
It'll also fix problems like running out of locktable space when you
access thousands of LOs in one transaction.
Also clean up cruft in read/write routines. LOs with "holes" in them
(never-written byte ranges) now work just like Unix files with holes do:
a hole reads as zeroes but doesn't occupy storage space.
INITDB forced!
> Regression tests opr_sanity and sanity_check are now failing.
Um, Bruce, I've said several times that I didn't think Perchine's large
object changes should be applied until someone had actually reviewed
them.
I tested it restoring my database with > 100000 BLOBS, and dumping it out.
But unfortunatly I can not restore it back due to problems in pg_dump.
--
Sincerely Yours,
Denis Perchine
> this is patch v 0.4 to support transactions with BLOBs.
> All BLOBs are in one table. You need to make initdb.
>
> --
> Sincerely Yours,
> Denis Perchine
calls. Outside a transaction, the backend detects them as buffer
leaks; it sends a NOTICE, and frees them. This sometimes cause a
segmentation fault (at least on Linux). These indexes are initialized
on the first lo_read/lo_write/lo_tell call, and (normally) closed
on a lo_close call. Thus the buffer leaks appear when lo direct
access functions are used, and not with lo_import/lo_export functions
(libpq version calls lo_close before ending the command, and the
backend version uses another path).
The included patches (against recent snapshot, and against 6.3.2)
cause indexes to be closed on transaction end (that is on explicit
'END' statment, or on command termination outside trasaction blocks),
thus preventing the buffer leaks while increasing performance inside
transactions. Some (all?) 'classic' memory leaks are also removed.
I hope it will be ok.
--- Pascal ANDRE, graduated from Ecole Centrale Paris andre@via.ecp.fr