Apache POI - HWPF - Java API to Handle Microsoft Word Files(Apache POI - HWPF - 处理 Microsoft Word 文件的 Java API)

Word File Format(字文件格式)

The Word 97 File Format in semi-plain English(半纯英文的 Word 97 文件格式)

The purpose of this document is to give a brief high level overview of the HWPF document format. This document does not go into in-depth technical detail and is only meant as a supplement to the Microsoft Word 97-2007 Binary File Format freely available from Microsoft.(本文档的目的是简要概述 HWPF 文档格式。本文档不深入介绍技术细节,仅作为 Microsoft 免费提供的 Microsoft Word 97-2007 二进制文件格式的补充。)

The OLE file format is not discussed in this document. It is assumed that the reader has a working knowledge of the POIFS API.(本文档不讨论 OLE 文件格式。假定读者是具有 POIFS API 的工作知识的。)

Word file structure(Word文件结构)

A Word file is made up of the document text and data structures containing formatting information about the text. Of course, this is a very simplified illustration. There are fields and macros and other things that have not been considered. At this stage, HWPF is mainly concerned with formatted text.(Word 文件由文档文本和包含有关文本的格式信息的数据结构组成。当然,这是一个非常简化的阐述。还有一些字段,宏以及其他东西没有被考虑。在这个阶段,HWPF 主要关注格式化文本。)

Reading Word files(读取 Word 文件)

The entry point for HWPF's reading of a Word file is the File Information Block (FIB). This structure is the entry point for the locations and size of a document's text and data structures. The FIB is located at the beginning of the main stream.(HWPF 读取 Word 文件的入口点是文件信息块 (FIB)。此结构是文档文本和数据结构的位置和大小的入口点。 FIB 位于主流的开头。)

Text(文本)

The document's text is also located in the main stream. Its starting location is given as FIB.fcMin and its length is given in bytes by FIB.ccpText. These two values are not very useful in getting the text because of unicode. There may be unicode text intermingled with ASCII text. That brings us to the piece table.(文档的文本也位于主流中。它的起始位置由 FIB.fcMin 给出,其长度由 FIB.ccpText 以字节为单位给出。由于 unicode,这两个值在获取文本时不是很有用。可能有 Unicode 文本与 ASCII 文本混合在一起。这将我们带到了计件表。)

The piece table is used to divide the text into non-unicode and unicode pieces. The size and offset are given in FIB.fcClx and FIB.lcbClx respectively. The piece table may contain Property Modifiers (prm). These are for complex(fast-saved) files and are skipped. Each text piece contains offsets in the main stream that contain text for that piece. If the piece uses unicode, the file offset is masked with a certain bit. Then you have to unmask the bit and divide by 2 to get the real file offset.(计件表用于将文本分为非 unicode 和 unicode 片断。大小和偏移量分别在 FIB.fcClx 和 FIB.lcbClx 中给出。计件表可能包含属性修饰符 (prm)。这些用于复杂(快速保存)文件并被跳过。每个文本片段都包含主流中包含该片段文本的偏移量。如果该片段使用 unicode,则文件偏移量会被某个位屏蔽。然后您必须取消屏蔽并除以 2 以获得真正的文件偏移量。)

Text Formatting(文本格式)

Stylesheet

All text formatting is based on styles contained in the StyleSheet. The StyleSheet is a data structure containing among other things, style descriptions. Each style description can contain a paragraph style and a character style or simply a character style. Each style description is stored in a compressed version on file. Basically these are deltas from another style.(所有文本格式都基于样式表中包含的样式。 StyleSheet 是一种数据结构,其中包含样式描述。每个样式描述可以包含一个段落样式和一个字符样式,或者只是一个字符样式。每个样式描述都存储在文件中的压缩版本中。基本上这些是另一种类型的deltas。)

Eventually, you have to chain back to the nil style which is an imaginary style with certain implied values.(最终,您必须回到 nil 样式,这是一种具有某些隐含值的虚构样式。)

Paragraph and Character styles

Paragraph and Character formatting properties for a document's text are stored on file as deltas from some base style in the Stylesheet. The deltas are used to create a complete uncompressed style in memory.(文档文本的段落和字符格式属性作为样式表中某些基本样式的增量存储在文件中。增量用于在内存中创建完整的未压缩样式。)

Uncompressed paragraph styles are represented by the Pargraph Properties(PAP) data structure. Uncompressed character styles are represented by the Character Properties(CHP) data structure. The styles for the document text are stored in compressed format in the corresponding Formatted Disk Pages (FKP). A compressed PAP is referred to as a PAPX and a compressed CHP is a CHPX. The FKP locations are stored in the bin table. There are separate bin tables for CHPXs and PAPXs. The bin tables' locations and sizes are stored in the FIB.(未压缩的段落样式由 Pargraph Properties(PAP) 数据结构表示。未压缩的字符样式由字符属性 (CHP) 数据结构表示。文档文本的样式以压缩格式存储在相应的格式化磁盘页面 (FKP) 中。压缩的 PAP 称为 PAPX,压缩的 CHP 称为 CHPX。 FKP 位置存储在 bin 表中。 CHPX 和 PAPX 有单独的 bin 表。 bin 表的位置和大小存储在 FIB 中。)

A FKP is a 512 byte OLE page. It contains the offsets of the beginning and end of each paragraph/character run in the main stream and the compressed properties for that interval. The compressed PAPX is based on its base style in the StyleSheet. The compressed CHPX is based on the enclosing paragraph's base style in the Stylesheet.(FKP 是一个 512 字节的 OLE 页。它包含主流中运行的每个段落/字符的开头和结尾的偏移量以及该间隔的压缩属性。压缩的 PAPX 基于其在 StyleSheet 中的基本样式。压缩的 CHPX 基于样式表中封闭段落的基本样式。)

Uncompressing styles and other data structures

All compressed properties(CHPX, PAPX, SEPX) contain a grpprl. A grpprl is an array of sprms. A sprm defines a delta from some base property. There is a table of possible sprms in the Word 97 spec. Each sprm is a two byte operand followed by a parameter. The parameter size depends on the sprm. Each sprm describes an operation that should be performed on the base style. After every sprm in the grpprl is performed on the base style you will have the style for the paragraph, character run, section, etc.(所有压缩属性(CHPX、PAPX、SEPX)都包含一个 grpprl。 grpprl 是一个 sprms 数组。 sprm 定义了一些基础属性的增量。 Word 97 规范中有一张可能的 sprms 表。每个 sprm 是一个两字节操作数,后跟一个参数。参数大小取决于 sprm。每个 sprm 都描述了一个应该在基本样式上执行的操作。在对基本样式执行 grpprl 中的每个 sprm 之后,您将获得段落、字符运行、部分等的样式。)

by S. Ryan Ackley(作者:S. Ryan Ackley)

 
中英文 | 中文 | 英文