Apache POI - HWPF and XWPF - Java API to Handle Microsoft Word Files(Apache POI - HWPF 和 XWPF - 用于处理 Microsoft Word 文件的 Java API)

Overview(概述)

Overview(概述)

HWPF is the name of our port of the Microsoft Word 97(-2007) file format to pure Java. It also provides limited read only support for the older Word 6 and Word 95 file formats.(HWPF 是我们将 Microsoft Word 97(-2007) 文件格式移植到纯 Java 的名称。它还为旧的 Word 6 和 Word 95 文件格式提供有限的只读支持。)

The partner to HWPF for the new Word 2007 .docx format is XWPF. Whilst HWPF and XWPF provide similar features, there is not a common interface across the two of them at this time.(新的 Word 2007 .docx 格式的 HWPF 的合作伙伴是 XWPF。虽然 HWPF 和 XWPF 提供了类似的功能,但目前两者之间还没有通用接口。)

Both HWPF and XWPF could be described as "moderately functional". For some use cases, especially around text extraction, support is very strong. For others, support may be limited or incomplete, and it may be necessary to dig down into low-level code. Error checking may be missing in places, so it may be possible to accidentally generate invalid files. Enhancements to fix such things are generally very well received!(HWPF 和 XWPF 都可以被描述为“中等功能”。对于一些用例,特别是文本提取,支持是非常强大的。对于其他的,支持可能有限或不完整,可能需要深入研究低级代码。可能在一些地方会缺失错误检查,因此可能会意外生成无效文件。修复此类问题的增强功能通常非常受欢迎!)

As detailed in the Components Page, HWPF is contained within the poi-scratchpad-XXX.jar, while XWPF is in the poi-ooxml-XXX.jar. You will need to ensure you include the appropriate jars (and their dependencies!) in your classpath to use HWPF or XWPF.(如组件页面中所述,HWPF 包含在 poi-scratchpad-XXX.jar 中,而 XWPF 包含在 poi-ooxml-XXX.jar 中。您需要确保在类路径中包含适当的 jar(及其依赖项!)以使用 HWPF 或 XWPF。)

Please note that in version 3.12, due to a bug, you might need to include poi-scratchpad-XXX.jar when using XWPF. This has been fixed again for the next release as there should not be such a dependency.(请注意,在 3.12 版本中,由于一个错误,您在使用 XWPF 时可能需要包含 poi-scratchpad-XXX.jar。因为不应该存在这样的依赖关系,这已经在下一个版本中再次修复,。)

An overview of the code(代码概述)

Source in the org.apache.poi.hwpf.model tree is the Java representation of internal Word format structure. This code is "internal", it shall not be used by your code. Code from org.apache.poi.hwpf.usermodel package is actual public and user-friendly (as much as possible) API to access document parts. Source code in the org.apache.poi.hwpf.extractor tree is a wrapper of this to facilitate easy extraction of interesting things (eg the Text), and org.apache.poi.hwpf.converter package contains Word-to-HTML and Word-to-FO converters (latest can be used to generate PDF from Word files when using with Apache FOP ). Also there is a small file-structure-dumping utility in org.apache.poi.hwpf.dev package, primally for developing purposes.(org.apache.poi.hwpf.model 树中的源代码是内部 Word 格式结构的 Java 表示。此代码是“内部”代码,您的代码不得使用它。 org.apache.poi.hwpf.usermodel 包中的代码是实际公共和用户友好(尽可能)的 API,用于访问文档部分。 org.apache.poi.hwpf.extractor 树中的源代码是一个包装器,便于轻松提取有趣的东西(例如文本),org.apache.poi.hwpf.converter 包包含 Word-to-HTML 和Word-to-FO 转换器(最新版本可用于在与 Apache FOP 一起使用时从 Word 文件生成 PDF)。 org.apache.poi.hwpf.dev 包中还有一个小型文件结构转储实用程序,主要用于开发目的。)

The main entry point to HWPF is HWPFDocument. Currently it has a lot of references both to internal interfaces ( org.apache.poi.hwpf.model package) and public API ( org.apache.poi.hwpf.usermodel ) package. It is possible that it will be split into two different interfaces (like WordFile and WordDocument) in later versions.(HWPF 的主要入口点是 HWPFDocument。目前它对内部接口(org.apache.poi.hwpf.model 包)和公共 API(org.apache.poi.hwpf.usermodel)包都有很多引用。在以后的版本中,它可能会被拆分为两个不同的接口(如 WordFile 和 WordDocument)。)

The main entry point to XWPF is XWPFDocument. From there, you can get the paragraphs, pictures, tables, sections, headers etc.(XWPF 的主要入口点是 XWPFDocument。您可以从那里获取段落、图片、表格、部分、标题等。)

Currently, there are only a handful of example programs using HWPF and XWPF available. They can be found in svn in the examples section, under HWPF and XWPF. Both HWPF and XWPF have fairly high levels of unit test coverage, which provides examples of using the various areas of functionality of both modules. These can be found in svn, under HWPF and XWPF. Contributions of more examples, whether inspired by the unit tests or not, would be most welcomed!(目前,只有少数使用 HWPF 和 XWPF 的示例程序可用。它们可以在示例部分的 svn 中找到,位于 HWPF 和 XWPF 下。 HWPF 和 XWPF 都具有相当高水平的单元测试覆盖率,这提供了使用两个模块的各个功能领域的示例。这些可以在 svn 中的 HWPF 和 XWPF 下找到。更多示例的贡献,无论是否受到单元测试的启发,都将受到欢迎!)

HWPF Notes(HWPF 注释)

A .doc Word document, as handled by HWPF, can be considered as very long single text buffer. The HWPF API provides "pointers" to document parts, like sections, paragraphs and character runs. Usually user will iterates over main document part sections, paragraphs from sections and character runs from paragraph. Each such interface is a pointer to document text subrange along with additional properties (and they all extends same Range parent class). There is additional Range implementations like Table, TableRow, TableCell, etc. Some structures like Bookmark or Field can also provide subranges pointers.(由 HWPF 处理的 .doc Word 文档可以被视为非常长的单个文本缓冲区。 HWPF API 为文档部分提供“指针”,例如部分、段落和字符运行。通常情况下,用户会遍历文档的主要部分、部分中的段落和段落中的字符运行。每个这样的接口都是一个指向文档文本子范围以及其他属性的指针(它们都扩展了相同的 Range 父类)。还有其他 Range 实现,如 Table、TableRow、TableCell 等。一些结构如 Bookmark 或 Field 也可以提供子范围指针。)

Changing file content usually requires a lot of synchronized changes in those structures like updating property boundaries, position handlers, etc. Because of that HWPF API shall be considered as not thread safe. In addition, there is a "one pointer" rule for changing content. It means you should not use two different Range instances at one time. More precisely, if you are changing file content using some range pointer, all other range pointers except parents' ones become invalid. For example if you obtain overall range (1), paragraph range (2) from overall range and character run range (3) from paragraph range and change text of paragraph, character run range is now invalid and should not be used, but overall range pointer still valid. Each time you obtaining range (pointer) new instance is created. It means if you obtained two range pointers and changed document text using first range pointer, second one became invalid.(更改文件内容通常需要在这些结构中进行大量同步更改,例如更新属性边界、位置处理程序等。因此,HWPF API 应被视为非线程安全的。此外,还有一个用于更改内容的“单指针”规则。这意味着您不应一次使用两个不同的 Range 实例。更准确地说,如果您使用某个范围指针更改文件内容,则除父级指针之外的所有其他范围指针都将变为无效。例如,如果您从整体范围中获取整体范围(1)、段落范围(2)和从段落范围中获取字符运行范围(3)并更改段落文本,则字符运行范围现在无效并且不应使用,但整体范围指针仍然有效。每次获取范围(指针)时,都会创建新实例。这意味着如果您获得了两个范围指针并使用第一个范围指针更改了文档文本,那么第二个将变为无效。)

XWPF Patches Required!(需要 XWPF 补丁!)

At the moment, XWPF covers many common use cases for reading and writing .docx files. Whilst this is a great thing, it does mean that XWPF does everything that the current POI committers need it to do, and so none of the committers are actively adding new features.(目前,XWPF 涵盖了许多常见的读取和写入 .docx 文件的用例。虽然这是一件很棒的事情,但这确实意味着 XWPF 可以完成当前 POI 提交者需要做的所有事情,因此没有提交者会积极添加新功能。)

If you come across a feature in XWPF that you need, and isn't currently there, please do send in a patch to add the extra functionality! More details on contributing patches are available on the "Contribution to POI" page.(如果您在 XWPF 中遇到您需要但目前不存在的功能,请发送补丁以添加额外功能!有关贡献补丁的更多详细信息,请参见“对 POI 的贡献”页面。)

HWPF Patches Required!(需要 HWPF 补丁!)

At the moment we unfortunately do not have someone taking care for HWPF and fostering its development. What we need is someone to stand up, take this thing under his hood as his baby and push it forward. Ryan Ackley, who put a lot of effort into HWPF, is no longer on board, so HWPF is an orphan child waiting to be adopted.(不幸的是,目前我们没有人关注 HWPF 并促进其发展。我们需要的是有人站出来,把这个东西视作他的宝贝,把它推向前。为 HWPF 倾注了大量心血的 Ryan Ackley 已不在船上,因此 HWPF 是一个等待收养的孤儿。)

If you are interested in becoming the new HWPF pointman, you should look into the Microsoft Word internals. A good starting point seems to be Ryan Ackley's overview. An introduction to the binary file formats is available from Microsoft, which has some good references and links. After that, the full details on the word format are available from Microsoft, but the documentation can be a little hard to get into at first... Try reading the overview first, and looking at the existing code, then finally look up the documentation for specific missing features.(如果您有兴趣成为新的 HWPF 指针,您应该查看 Microsoft Word 内部结构。 Ryan Ackley 的概述似乎是一个很好的起点。 Microsoft 提供了二进制文件格式的介绍,其中有一些很好的参考资料和链接。之后,可以从 Microsoft 获得有关 word 格式的完整详细信息,但起初文档可能有点难以理解...尝试先阅读概述,然后查看现有代码,最后在文档中查找特定缺失的特性。)

As a first step you should familiarize yourself with the source code, examples, test cases, and the HWPF patches available at Bugzilla (if any). Then you should compile an overview of(作为第一步,您应该熟悉源代码、示例、测试用例和 Bugzilla 提供的 HWPF 补丁(如果有的话)。然后你应该编译一个概述。)

  • the current HWPF status,(当前的 HWPF 状态,)
  • the patches in Bugzilla to be checked in (and those that should better be ditched),(Bugzilla 中要签入的补丁(以及最好放弃的补丁),)
  • the available test cases and the test cases still to be written,(可用的测试用例和仍有待编写的测试用例,)
  • the available documentation and the docs to be written,(可用的文档和要编写的文档,)
  • anything else that seems reasonable(任何其他看起来合理的事情)

When you start coding, you will not yet have write access to the SVN repository. Please submit your patches to Bugzilla and nag the dev list until someone commits them. Besides the actual checking in of HWPF patches, current POI committers will also do some minor reviews now and then of your source code patches, test cases and documentation to help ensure software quality. But most of the time you will be on your own. However, anyone offering useful contributions over a period of time will be offered committership!(当您开始编码时,您还没有 SVN 存储库的写入权限。请将您的补丁提交给 Bugzilla 并在开发列表中唠叨直到有人提交它们。除了实际签入 HWPF 补丁外,当前的 POI 提交者还会不时对您的源代码补丁、测试用例和文档进行一些小的审查,以帮助确保软件质量。但大多数时候你将独自一人。但是,任何在一段时间内提供有用贡献的人都将得到承诺!)

Please do not forget to write JUnit test cases and documentation! We won't accept code that doesn't come with test cases. And please consider that other contributors should be able to understand your source code easily. If you need any help getting started with JUnit test cases for HWPF, please ask on the developers' mailing list! If you show that you are prepared to stick at it you will most likely be given SVN commit access. See "Contribution to POI" page for more details and help getting started.(请不要忘记编写 JUnit 测试用例和文档!我们不会接受没有测试用例的代码。请考虑其他贡献者应该能够轻松理解您的源代码。如果您在开始使用 HWPF 的 JUnit 测试用例时需要任何帮助,请在开发人员的邮件列表中提问!如果您表明您已准备好坚持下去,您很可能会获得 SVN 提交访问权限。有关更多详细信息和入门帮助,请参阅“对 POI 的贡献”页面。)

Of course we will help you as best as we can. However, presently there is no committer who is really familiar with the Word format, so you'll be mostly on your own. We are looking forward for you and your contributions! Honor and glory of becoming a POI committer are waiting!(当然,我们会尽我们所能帮助您。但是,目前还没有真正熟悉 Word 格式的提交者,您将主要依靠自己。我们期待您和您的贡献!成为POI提交者的荣誉和荣耀在等着您!)

by Nicola Ken Barozzi, Andrew C. Oliver, Ryan Ackley, Rainer Klute(作者:Nicola Ken Barozzi、Andrew C. Oliver、Ryan Ackley、Rainer Klute)

 
中英文 | 中文 | 英文