[Home] [Download] [Previous] [Next]
OfficeFMT currently provides 2 different XSLT filtering components:
org.openoffice.comp.officefmt.xslt.GenericXSLTFilter
org.openoffice.comp.officefmt.xml.WriterFlatXMLOptimizer.
The first component is very similar to the generic XSLT filter, bundled with OpenOffice.org 1.1. However, I had to add this filter into my package in order to implement the following 2 features not available in the standard XSLT filter:
a possibility to access XSLT files stored in OpenOffice.org
pkgchk cache;
an access to the FilterData argument passed to this filter. This argument may contain different previously set (e.g. in a GUI options dialog) filter options. Later the options are passed to the corresponding XSLT transformation, so that it is possible to create a customizable XSLT filter by this way.
This component may be used in combination with style sheets designed to transform any types of OpenOffice.org documents.
The second filtering component is based on the first one, but is designed especially for OpenOffice.org Writer documents. This component performs some XML code clean up before passing it to a stylesheet, i.e. it parses an XML document for redundant formatting tags and removes them, if necessary. It is well known that documents generated by OpenOffice.org are not always as clear as they could be. In particular, the most common problems with XML layout in Writer documents are the following:
Sometimes several adjacent <text:span> elements with the same automatic style are generated. This situation is described here:
http://www.openoffice.org/issues/show_bug.cgi?id=23552
If WriterFlatXMLOptimizer filtering component finds such redundant formatting tags, it joins them together.
Sometimes "hard" formatting of a text range just reproduces "soft" formatting already applied to that range. For example, if you have manually formatted some text in bold, and then applied to that paragraph a heading style which also have bold weight in its properties, the initial <text:span> tag specifying the character formatting will not be removed, although it is not needed now. The worst thing is that you can't find such ranges with hard formatting using the OpenOffice.org search/replace functionality, because searching for formatted text is currently broken and nobody plans to fix it. The additional information is found on this page:
http://www.openoffice.org/issues/show_bug.cgi?id=10569
WriterFlatXMLOptimizer tries to recognize such redundant formatting properties and remove them either by resetting style to default (on paragraph level), or by removing the corresponding formatting tags (on character level). Of course, removing formatting tags doesn't affect the text content itself, which is always preserved. So with OfficeFMT you can use saving to the FlatXML format in order to "clean up" your documents from some garbage collected during multiple editing cycles.
Another type of "garbage" may be "asian" and "complex" formatting, which is generated automatically e.g. when you import a Microsoft Word document to the sxw format. For more information look here:
http://www.openoffice.org/issues/show_bug.cgi?id=14013
The filtering component may treat all "asian" and "complex" formatting properties as redundant and remove them from your document. Of course this may be inconvenient for people who really need CJK or complex languages in their work. So this feature is active only if support for the corresponding type of scripts ("asian" or "complex") is disabled in OpenOffice.org settings (Tools -> Options -> Language Settings -> Languages).
The WriterFlatXMLOptimizer filtering component is used by all XSLT based filters, available in OfficeFMT, namely:
OpenOffice.org FlatXML import/export filter;
The FlatXML format is nearly the same thing as standard OpenOffice file format: the only difference is that the same XML layout which is splitted trough several files inside an sxw document is stored in a single XML file and without compression. Note that generating FlatXML is always a starting point for all XML based conversions, and so this job should be already performed before passing XML code to a stylesheet. So my version of OpenOffice.org Writer FlatXML filter just extends my XSLT filtering component with a simple office2flat.xsl stylesheet, which reproduces the code almost as it is, but additionally performs the following 2 things:
Indentation is added to the XML output in order to make it more human readable.
A link to another xsl stylesheet added into the generated xml documents. This xsl stylesheet (office2html.xsl) is also available in the xslt/office2html/ subdirectory of the OfficeFMT zipped package. So if you extract this file from the archive and put it into a directory where FlatXML files generated with OfficeFMT are stored, you will be able to preview your FlatXML documents with a Web browser (Mozilla or MSIE).
Of course the officefmt.xsl stylesheet is designed mainly for rather simple text documents, i. e. those including only text and tables. However, it can correctly reproduce almost all types of paragraph and character formatting. It also correctly handles footnotes/endnotes (they are collected at the end of the document and links to them are added to the body text) and (in most cases) correctly reproduces table layout, even for complex tables with cells spanned through several columns.
Note that the XML code passed to the Flat XML filter is always already cleaned up by the WriterFlatXMLOptimizer component. Of course it was possible to implement this code cleanup in pure XSLT, but, unfortunately, processing such a stylesheet might take a lot of machine time and resources (especially with the Xalan XSLT processor, which is used by Java (and so, by OpenOffice.org too) by default). However, OfficeFMT additionally includes a sample stylesheet (called writer2flat.xsl), which does the same job as the WriterFlatXMLOptimizer component and the office2flat.xsl file together. So you may use this stylesheet e. g. in order to beautify and clean up XML code produced by any other version of OpenOffice.org Writer FlatXML filter.
The same office2html.xsl stylesheet which is designed for displaying FlatXML files in a browser, can also be used for direct conversion of OpenOffice.org Writer files into the xhtml format. Once again, this stylesheet is designed mainly for simple files, but in some cases it can produce better results than the standard XHTML filter (which, for example, simply omits footnotes and endnotes).