POI-HPBF - A Guide to the Publisher File Format(POI-HPBF - 发布者文件格式指南)

Overview(概述)

Document Streams(文档流)

The file is made up of a number of POIFS streams. A typical file will be made up as follows:(该文件由许多 POIFS 流组成。一个典型的文件将组成如下:)

Root Entry -
Objects -
(no children)
SummaryInformation <(0x05)SummaryInformation>
DocumentSummaryInformation <(0x05)DocumentSummaryInformation>
Escher -
EscherStm
EscherDelayStm
Quill -
QuillSub -
CONTENTS
CompObj <(0x01)CompObj>
Envelope
Contents
Internal <(0x03)Internal>
CompObj <(0x01)CompObj>
VBA -
(no children)

Changing Text(更改文本)

If you make a change to the text of a file, but not change how much text there is, then the CONTENTS stream will undergo a small change, and the Contents stream will undergo a large change.(如果您对文件的文本进行了更改,但没有更改文本的数量,那么 CONTENTS 流将发生小的变化,而 Contents 流将发生很大的变化。)

If you make a change to the text of a file, and change the amount of text there is, then both the Contents and the CONTENTS streams change.(如果您更改文件的文本并更改文本的数量,那么 Contents 和 CONTENTS 流都会更改。)

Changing Shapes(改变形状)

If you alter the size of a textbox, but make no text changes, then both Contents and CONTENTS streams change. There are no changes to the Escher streams.(如果您更改文本框的大小,但不更改文本,则 Contents 和 CONTENTS 流都会更改。 Escher 流没有变化。)

If you set the background colour of a textbox, but make no changes to the text, (to finish off)(如果您设置文本框的背景颜色,但不更改文本,(结束))

Structure of CONTENTS(内容结构)

First we have "CHNKINK ", followed by 24 bytes.(首先我们有“CHNKINK”,然后是 24 个字节。)

Next we have 20 sequences of 24 bytes each. If the first two bytes at 0x1800, then that sequence entry exists, but if it's 0x0000 then the entry doesn't exist. If it does exist, we then have 4 bytes of upper case ASCII text, followed by three little endian shorts. The first of these seems to be the count of that type, the second is usually 1, the third is usually zero. The we have another 4 bytes of upper case ASCII text, normally but not always the same as the first text. Finally, we have an unsigned little endian 32 bit offset to the start of the data for this, then an unsigned little endian 32 bit offset of the length of this section.(接下来我们有 20 个序列,每个序列 24 个字节。如果前两个字节位于 0x1800,则该序列条目存在,但如果它是 0x0000,则该条目不存在。如果确实存在,那么我们有 4 个字节的大写 ASCII 文本,后跟三个 little endian short。其中第一个似乎是该类型的计数,第二个通常是 1,第三个通常是零。我们还有另外 4 个字节的大写 ASCII 文本,通常但并不总是与第一个文本相同。最后,我们有一个到数据开头的无符号小端 32 位偏移量,然后是本节长度的无符号小端 32 位偏移量。)

Normally, the first sequence entry is for TEXT, and the text data will start at 0x200. After that is normally two or three STSH entries (so the first short has values 0, then 1, then 2). After that it seems to vary.(通常,第一个序列条目用于 TEXT,文本数据将从 0x200 开始。之后通常是两个或三个 STSH 条目(所以第一个 short 的值是 0,然后是 1,然后是 2)。在那之后,它似乎有所不同。)

At 0x200 we have the text, stored as little endian 16 bit unicode.(在 0x200 我们有文本,存储为 little endian 16 位 unicode。)

After the text comes all sorts of other stuff, presumably as described by the sequences.(在文本之后是各种其他的东西,大概如序列所描述的那样。)

For a contents stream of length 7168 / 0x1c00 bytes, the start looks something like:(对于长度为 7168 / 0x1c00 字节的内容流,开始看起来像:)

CHNKINK // "CHNKINK "(CHNKINK // "CHNKINK")
04 00 07 00 // Normally 04 00 07 00(04 00 07 00 // 通常为 04 00 07 00)
13 00 00 03 // Normally ## 00 00 03(13 00 00 03 // 通常 ## 00 00 03)
00 02 00 00 // Normally 00 ## 00 00(00 02 00 00 // 通常为 00 ## 00 00)
00 1c 00 00 // Normally length of the stream(00 1c 00 00 // 通常流的长度)
f8 01 13 00 // Normally f8 01 11/13 00(f8 01 13 00 // 通常 f8 01 11/13 00)
ff ff ff ff // Normally seems to be ffffffff(ff ff ff ff // 通常看起来是 ffffffff)
18 00
TEXT 00 00 01 00 00 00 // TEXT 0 1 0(文本 00 00 01 00 00 00 // 文本 0 1 0)
TEXT 00 02 00 00 d0 03 00 00 // TEXT from: 200 (512), len: 3d0 (976)(TEXT 00 02 00 00 d0 03 00 00 // 文本来自:200 (512),长度:3d0 (976))
18 00
STSH 00 00 01 00 00 00 // STSH 0 1 0(STSH 00 00 01 00 00 00 // STSH 0 1 0)
STSH d0 05 00 00 1e 00 00 00 // STSH from: 5d0 (1488), len: 1e (30)(STSH d0 05 00 00 1e 00 00 00 // STSH 来自:5d0 (1488), len: 1e (30))
18 00
STSH 01 00 01 00 00 00 // STSH 1 1 0(STSH 01 00 01 00 00 00 // STSH 1 1 0)
STSH ee 05 00 00 b8 01 00 00 // STSH from: 5ee (1518), len: 1b8 (440)(STSH ee 05 00 00 b8 01 00 00 // STSH 来自:5ee(1518),len:1b8(440))
18 00
STSH 02 00 01 00 00 00 // STSH 2 1 0(STSH 02 00 01 00 00 00 // STSH 2 1 0)
STSH a6 07 00 00 3c 00 00 00 // STSH from: 7a6 (1958), len: 3c (60)(STSH a6 07 00 00 3c 00 00 00 // STSH 来自:7a6(1958),len:3c(60))
18 00
FDPP 00 00 01 00 00 00 // FDPP 0 1 0(FDPP 00 00 01 00 00 00 // FDPP 0 1 0)
FDPP 00 08 00 00 00 02 00 00 // FDPP from: 800 (2048), len: 200 (512)(FDPP 00 08 00 00 00 02 00 00 // FDPP 来自:800 (2048),长度:200 (512))
18 00
FDPC 00 00 01 00 00 00 // FDPC 0 1 0(FDPC 00 00 01 00 00 00 // FDPC 0 1 0)
FDPC 00 0a 00 00 00 02 00 00 // FDPC from: a00 (2560), len: 200 (512)(FDPC 00 0a 00 00 00 02 00 00 // FDPC 来自:a00 (2560), len: 200 (512))
18 00
FDPC 01 00 01 00 00 00 // FDPC 1 1 0(FDPC 01 00 01 00 00 00 // FDPC 1 1 0)
FDPC 00 0c 00 00 00 02 00 00 // FDPC from: c00 (3072), len: 200 (512)(FDPC 00 0c 00 00 00 02 00 00 // FDPC 来自:c00 (3072), len: 200 (512))
18 00
SYID 00 00 01 00 00 00 // SYID 0 1 0(SYID 00 00 01 00 00 00 // SYID 0 1 0)
SYID 00 0e 00 00 20 00 00 00 // SYID from: e00 (3584), len: 20 (32)(SYID 00 0e 00 00 20 00 00 00 // SYID 来自:e00 (3584), len: 20 (32))
18 00
SGP 00 00 01 00 00 00 // SGP 0 1 0(新加坡币 00 00 01 00 00 00 // 新加坡币 0 1 0)
SGP 20 0e 00 00 0a 00 00 00 // SGP from: e20 (3616), len: a (10)(SGP 20 0e 00 00 0a 00 00 00 // SGP 来自:e20 (3616), len: a (10))
18 00
INK 00 00 01 00 00 00 // INK 0 1 0(墨水 00 00 01 00 00 00 // 墨水 0 1 0)
INK 2a 0e 00 00 04 00 00 00 // INK from: e2a (3626), len: 4 (4)(INK 2a 0e 00 00 04 00 00 00 // INK 来自:e2a (3626), len: 4 (4))
18 00
BTEP 00 00 01 00 00 00 // BTEP 0 1 0(BTEP 00 00 01 00 00 00 // BTEP 0 1 0)
PLC 2e 0e 00 00 18 00 00 00 // PLC from: e2e (3630), len: 18 (24)(PLC 2e 0e 00 00 18 00 00 00 // PLC 来自:e2e (3630), len: 18 (24))
18 00
BTEC 00 00 01 00 00 00 // BTEC 0 1 0(BTEC 00 00 01 00 00 00 // BTEC 0 1 0)
PLC 46 0e 00 00 20 00 00 00 // PLC from: e46 (3654), len: 20 (32)(PLC 46 0e 00 00 20 00 00 00 // PLC 来自:e46 (3654), len: 20 (32))
18 00
FONT 00 00 01 00 00 00 // FONT 0 1 0(字体 00 00 01 00 00 00 // 字体 0 1 0)
FONT 66 0e 00 00 48 03 00 00 // FONT from: e66 (3686), len: 348 (840)(FONT 66 0e 00 00 48 03 00 00 // FONT from: e66 (3686), len: 348 (840))
18 00
TCD 03 00 01 00 00 00 // TCD 3 1 0(TCD 03 00 01 00 00 00 // TCD 3 1 0)
PLC ae 11 00 00 24 00 00 00 // PLC from: 11ae (4526), len: 24 (36)(PLC ae 11 00 00 24 00 00 00 // PLC 来自:11ae (4526), len: 24 (36))
18 00
TOKN 04 00 01 00 00 00 // TOKN 4 1 0(令牌 04 00 01 00 00 00 //令牌 4 1 0)
PLC d2 11 00 00 0a 01 00 00 // PLC from: 11d2 (4562), len: 10a (266)(PLC d2 11 00 00 0a 01 00 00 // PLC 来自:11d2 (4562), len: 10a (266))
18 00
TOKN 05 00 01 00 00 00 // TOKN 5 1 0(代币 05 00 01 00 00 00 // 代币 5 1 0)
PLC dc 12 00 00 2a 01 00 00 // PLC from: 12dc (4828), len: 12a (298)(PLC dc 12 00 00 2a 01 00 00 // PLC 来自:12dc (4828), len: 12a (298))
18 00
STRS 00 00 01 00 00 00 // STRS 0 1 0(STRS 00 00 01 00 00 00 // STRS 0 1 0)
PLC 06 14 00 00 46 00 00 00 // PLC from: 1406 (5126), len: 46 (70)(PLC 06 14 00 00 46 00 00 00 // PLC 来自:1406 (5126),长度:46 (70))
18 00
MCLD 00 00 01 00 00 00 // MCLD 0 1 0(MCLD 00 00 01 00 00 00 // MCLD 0 1 0)
MCLD 4c 14 00 00 16 06 00 00 // MCLD from: 144c (5196), len: 616 (1558)(MCLD 4c 14 00 00 16 06 00 00 // MCLD 来自:144c (5196),len:616 (1558))
18 00
PL 00 00 01 00 00 00 // PL 0 1 0(PL 00 00 01 00 00 00 // PL 0 1 0)
PL 62 1a 00 00 48 00 00 00 // PL from: 1a62 (6754), len: 48 (72)(PL 62 1a 00 00 48 00 00 00 // PL 来自:1a62 (6754), len: 48 (72))
00 00 // Blank entry follows(00 00 // 后面是空白条目)
00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
(the text will then start)

We think that the first 4 bytes of text describes the the function of the data at the offset. The first short is then the count of that type, eg the 2nd will have 1. We think that the second 4 bytes of text describes the format of data block at the offset. The format of the text block is easy, but we're still trying to figure out the others.(我们认为前 4 个字节的文本描述了数据在偏移处的作用。第一个short 是该类型的计数,例如第二个将有1。我们认为第二个4 字节的文本描述了偏移处数据块的格式。文本块的格式很简单,但我们仍在尝试找出其他格式。)

Structure of TEXT bit(TEXT 位的结构)

This is very simple. All the text for the document is stored in a single bit of the Quill CONTENTS. The text is stored as little endian 16 bit unicode strings.(这很简单。文档的所有文本都存储在 Quill CONTENTS 的一个位中。文本存储为 little endian 16 位 unicode 字符串。)

Structure of PLC bit(PLC位的结构)

The first four bytes seem to hold the count of the entries in the bit, and the second four bytes seem to hold the type. There is then some pre-data, and then data for each of the entries, the exact format dependant on the type.(前四个字节似乎保存了位中条目的计数,后四个字节似乎保存了类型。然后是一些预数据,然后是每个条目的数据,确切的格式取决于类型。)

Type 0 has 4 2 byte unsigned ints, then a pair of 2 byte unsigned ints for each entry.(类型 0 有 4 个 2 字节无符号整数,然后每个条目有一对 2 字节无符号整数。)

Type 4 has 4 2 byte unsigned ints, then a pair of 4 byte unsigned ints for each entry.(类型 4 有 4 个 2 字节无符号整数,然后每个条目有一对 4 字节无符号整数。)

Type 8 has 7 2 byte unsigned ints, then a pair of 4 byte unsigned ints for each entry.(类型 8 有 7 个 2 字节无符号整数,然后每个条目有一对 4 字节无符号整数。)

Type 12 holds hyperlinks, and is very much more complex. See org.apache.poi.hpbf.model.qcbits.QCPLCBit for our best guess as to how the contents match up.(类型 12 包含超链接,并且非常复杂。有关内容如何匹配的最佳猜测,请参见 org.apache.poi.hpbf.model.qcbits.QCPLCBit。)

by Nick Burch(通过尼克伯奇)

 
中英文 | 中文 | 英文