This is an attempt to document the cyrus mailbox format. It should not be considered authoritative and is subject to change.
No external tools should make use of this information. The only supported method of access to the mail store is through the standard interfaces: IMAP, POP, NNTP, LMTP, etc.
A cyrus mailbox is a directory in the filesystem. It contains the following files:
The message files are named by their UID, followed by a ".", so UID 423 would be named "423.". They are stored in wire-format: lines are terminated by CRLF and binary data is not allowed.
This file contains mailbox-wide information that does not change that often. Its format:
<Mailbox Header Magic String> <Quota Root>\t<Mailbox Unique ID String>\n <Space-separated list of user flags>\n <Mailbox ACL>\n
xxx not just caches; the index file stores stuff not present in the message file!
These files cache frequently accessed information on a per-message basis. The index file holds fixed-length records on a per-message basis (and a header for the mailbox of related metadata), while the cache file holds variable-length information.
Any binary data in these files is stored in network byte order. All of the binary data is also 4-byte aligned. Strings in the cyrus.cache are stored NUL-terminated (this only applies to cyrus.cache). To ensure alignment of following data, the end of strings may be NUL-padded by up to 4 bytes.
The cyrus.expunge file has the exact same format as cyrus.index, and holds the records of expunged messages which have yet to have their corresponding cache records and messages files deleted.
The overall format of these files looks sort of like this:
cyrus.index: +----------------+ | Mailbox Header | +----------------+ | Msg: Seq Num 1 | +----------------+ | Msg: Seq Num 2 | +----------------+ | ... | +----------------+
The basic idea being that there is one header, and then all the message records are evenly spaced throughout the file. All of the message records are at well-known offsets, making any part of the file accessable at roughly equal speed.
cyrus.cache: +------------------------------------------------------------------------+ |Gen # (32bits)|Size 1 (32bits)|Data 1 | +------------------------------------------------------------------------+ | |Size 2 (32bits)|Data 2 |Size 3 (32bits)| Data 3 | +------------------------------------------------------------------------+ | ..... | +------------------------------------------------------------------------+
The cache file is different from the index file. It starts with a 4 byte header (the generation number—more on that later), then it has a whole bunch of entries in (size)(data) format. The entries for each message are always consecutive, and in the same order (i.e. for any given message, the envelope is always the first bit of data), but there is no way to tell (without use of an offset from the index file) what message starts where.
The index header contains the following information, in order:
There are also spare fields in the index header, to allow for future expansion without forcing an upgrade of the file.
These records start immediately following the cyrus.index header, and are all fixed size. They are in-order by sequence number of the message.
The order of fields per record in the cache file is as follows: (keep in mind that they are all preceeded by a 4 byte network byte order size).
Offsets into the message file to pull out various body parts. Because of the nature of MIME parts, this is somewhat recursive.
This looks like the following (starting the octet following the cache field size). All of the fields are bit32s.
  [
   [Number of message parts+1 for the rfc822 header if present]
   [
    [Offset in the message file of the header of this part]
    [Size (octets) of the header of this part]
    [Offset in the message file of the content of this part]
    [Size (octets) of the content of this part]
    [Encoding Type of this part]
   ]
      (repeat for each part as well as once for the headers)
   [zero *or* number of sub-parts in the case of a multipart.
    if nonzero, this is a recursion into the top structure]
      (repeat for each part)
  ] 
Note if this is not a message/rfc822, than the values for the sizes of the part 0 are -1 (to indicate that it doesn't exist). Sub-parts are not possible for a part 0, so they aren't included when finding recursive entries.
The offset and size info for both the mime header and content part are useful in order to do fast indexing on the appropriate parts of the message file when a client does a FETCH request for BODY[HEADER], or BODY[2.MIME].
Note that the top level RFC822 headers are a treated as a separate part from their body text ("0" or "HEADER").
In the case of a multipart/alternative, the content size & offset refers to the size of the entire mime part.
A very simple message (with a single text/plain part) would therefore look like:
[[2][rfc822 header][text/plain body part info][0]]
A simple multipart/alternative message might look like:
  [[3][rfc822 header][text/plain message part info]
      [second message part info][0][0]]
A message with an attachment that has two subparts:
[[3][rfc822 header info][rfc822 first body part info][attachment info][0][ [3][NIL header info][sub part 1 info][sub part 2 info][0][0]]]
A message with an attached message/rfc822 message with the following total structure:
    message/rfc822
      0 headers; content-type: multipart/mixed
      1 text/plain
      2 message/rfc822
        0 headers; content-type: multipart/alternative
        1 text/plain
        2 text/html
  [[3][rfc822 header part 0][text/plain part 1][overall attachment info][0][
       [3][rfc822 header part 2.0][text/plain part 2.1][text/html part 2.2]
          [0][0]]]
Any cached header fields. These are in the same format they would appear in the message file:
HeaderName: headerdata\r\n
Examples include: References, In-Reply-To, etc.
The message isn't delivered until the new index header is written. In case of a crash before the new index header is written, any previous writes will be overwritten on the next delivery (and will not be noticed by the readers).
Note that certain power failure situations (power failure in the middle of a disk sector write) could cause a mailbox to need reconstruction (possibly even losing some flag state). These failure modes are not possible in the "Hardware RAID disk model" (which we will describe somewhere else when we get around to it).