Re: Word files & Build vs. Buy?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Word files & Build vs. Buy?

Christiaan Fluit
Hello all,

I'm replying to two threads at once as what I have to say relates to both.

My company recently started an open source project called Aperture
(http://sourceforge.net/projects/aperture), together with the German
DFKI institute. The project is still very much in alpha stage, but I do
believe we already have some code parts that could help people here.

Basically, it's a framework for crawling information sources (file
systems, mail folders, websites, ...) and extracting as much information
from it as possible. Besides full-text extraction, we also put a lot of
effort in extraction and modeling of the metadata occurring in these
sources and document formats. Both parties have some proprietary code
lying on the shelf that is being open sourced and ported to the Aperture
architecture.

Now on to the raised questions:

[hidden email] wrote:
> WordDocument wd = new WordDocument(is);

[hidden email] wrote:
> MS Word - I know that POI exists, but development on the Word portion
> seems to have stopped, and there are a lot of nasty looking bugs in
> their DB.  Since we're involved in dealing with contracts, many of our
> Word files are large and complicated.  How has everyone's experience
> with POI's Word parsing been?

My experience is that the WordDocument class crashes on about 25% of the
documents, i.e. it throws some sort of Exception. I've tested POI
2.5.1-final as well as the current code in CVS, but both produce this
result. I even suspect the output to be 100% the same, but I haven't
verified this.

Another reason I don't like this class is that it operates on an
InputStream and internally creates a POIFSFileSystem which you cannot
access, so that it becomes hard to extract document metadata as well
(for which you need the PFSFS) without buffering the entire InputStream.
The same applies to TextMining's WordExtractor, which also operates on
top of lower level POI components.

I've recently committed a WordExtractor to Aperture that uses its own
code operating on these lower level POI datastructures, which works a
lot better, failing only 5% of my 300 test docs. I don't pretend to
understand all the internals of the POI APIs, but it Works For Me.

When POI throws an exception, the WordExtractor will revert to applying
a heuristic string extraction algorithm to extract as much
human-readable text as possible from the binary stream, which works
quite well on MS Office files, i.e. the output is reasonably well for
indexing purposes.

Be sure to checkout Aperture from CVS as this code isn't part of the
alpha 1 release. A next official release is expected in a month.

[hidden email] wrote:
> RTF - javax.swing looks fine, we use those classes already.

Swing's RTFEditorKit does indeed work surpringly well. "Surprisingly"
because in the past I had many issues with it, typically throwing
exceptions on 25-50% of my test documents. Recently I haven't seen a
single one (using Java 1.5.0), so either I am now feeding it a more
optimal document set or the Swing people have worked on the
implementation. In that case people using Java 1.4.x may see different
results.

> Word Perfect - There doesn't seem to be any converters for this format?

I'm actively working on this :) We have some proprietary code that will
become part of Aperture. Right now I cannot say how well it performs in
practice though, although we've never had complaints with our
proprietary apps.

The code uses a heuristic string extraction algorithm tuned for
WordPerfect documents. This may be an issue, e.g. when you also want to
display the extraction results to end users.

If you're interested: one way you can help me get the most out of it is
by sending me some example WordPerfect documents because I hardly have
those on my hard drive. Fake documents made with very new or old
WordPerfect versions are also most welcome.


Regards,

Chris
http://aduna.biz
--

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Word files & Build vs. Buy?

Nick Burch
On Thu, 9 Feb 2006, Christiaan Fluit wrote:
> My experience is that the WordDocument class crashes on about 25% of the
> documents, i.e. it throws some sort of Exception. I've tested POI
> 2.5.1-final as well as the current code in CVS, but both produce this
> result. I even suspect the output to be 100% the same, but I haven't
> verified this.

You could try using org.apache.poi.hwpf.HWPFDocument, and getting the
range, then the paragraphs, and grab the text from each paragraph. If
there's interest, I could probably commit an extractor that does this to
poi.

(WordDocument is from the hdf package, which is older and less reliable
than the current hwpf stuff)

> Another reason I don't like this class is that it operates on an
> InputStream and internally creates a POIFSFileSystem which you cannot
> access, so that it becomes hard to extract document metadata as well
> (for which you need the PFSFS) without buffering the entire InputStream.

If you're using HWPFDocument from cvs, then you can create that from a
POIFSFileSystem.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Word files & Build vs. Buy?

Christiaan Fluit
Nick Burch wrote:
> You could try using org.apache.poi.hwpf.HWPFDocument, and getting the
> range, then the paragraphs, and grab the text from each paragraph. If
> there's interest, I could probably commit an extractor that does this to
> poi.

Yes, that's exactly what I'm doing. Having this in POI would benefit me
a lot though, as I hardly understand the POI basics to be honest (my
fault, not POI's).

This is my current code (adapted from Aperture code in CVS):

HWPFDocument doc = new HWPFDocument(poiFileSystem);
StringBuffer buffer = new StringBuffer(4096);

Iterator textPieces = doc.getTextTable().getTextPieces().iterator();
while (textPieces.hasNext()) {
        TextPiece piece = (TextPiece) textPieces.next();

        // the following is derived from
        // http://article.gmane.org/gmane.comp.jakarta.poi.devel/7406
        String encoding = "Cp1252";
        if (piece.usesUnicode()) {
                encoding = "UTF-16LE";
        }

        buffer.append(new String(piece.getRawBytes(), encoding));
}

// normalize end-of-line characters and remove any lines
// containing macros
BufferedReader reader = new BufferedReader(new
     StringReader(buffer.toString()));
buffer.setLength(0);

String line;
while ((line = reader.readLine()) != null) {
        if (line.indexOf("DOCPROPERTY") == -1) {
                buffer.append(line);
                buffer.append(END_OF_LINE);
        }
}

// fetch the extracted full-text
String text = buffer.toString();


Regards,

Chris
--

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Word files & Build vs. Buy?

Dmitry Goldenberg
In reply to this post by Christiaan Fluit
Chris,
 
Awesome stuff. A few questions: is your Excel extractor somehow better than POI's? and, what do you see as the timeframe for adding WordPerfect support? Are you considering supporting any other sources such as MS Project, Framemaker, etc?
 
Thanx,
- Dmitry

________________________________

From: Christiaan Fluit [mailto:[hidden email]]
Sent: Thu 2/9/2006 4:09 AM
To: [hidden email]
Subject: Re: Word files & Build vs. Buy?



Hello all,

I'm replying to two threads at once as what I have to say relates to both.

My company recently started an open source project called Aperture
(http://sourceforge.net/projects/aperture), together with the German
DFKI institute. The project is still very much in alpha stage, but I do
believe we already have some code parts that could help people here.

Basically, it's a framework for crawling information sources (file
systems, mail folders, websites, ...) and extracting as much information
from it as possible. Besides full-text extraction, we also put a lot of
effort in extraction and modeling of the metadata occurring in these
sources and document formats. Both parties have some proprietary code
lying on the shelf that is being open sourced and ported to the Aperture
architecture.

Now on to the raised questions:

[hidden email] wrote:
> WordDocument wd = new WordDocument(is);

[hidden email] wrote:
> MS Word - I know that POI exists, but development on the Word portion
> seems to have stopped, and there are a lot of nasty looking bugs in
> their DB.  Since we're involved in dealing with contracts, many of our
> Word files are large and complicated.  How has everyone's experience
> with POI's Word parsing been?

My experience is that the WordDocument class crashes on about 25% of the
documents, i.e. it throws some sort of Exception. I've tested POI
2.5.1-final as well as the current code in CVS, but both produce this
result. I even suspect the output to be 100% the same, but I haven't
verified this.

Another reason I don't like this class is that it operates on an
InputStream and internally creates a POIFSFileSystem which you cannot
access, so that it becomes hard to extract document metadata as well
(for which you need the PFSFS) without buffering the entire InputStream.
The same applies to TextMining's WordExtractor, which also operates on
top of lower level POI components.

I've recently committed a WordExtractor to Aperture that uses its own
code operating on these lower level POI datastructures, which works a
lot better, failing only 5% of my 300 test docs. I don't pretend to
understand all the internals of the POI APIs, but it Works For Me.

When POI throws an exception, the WordExtractor will revert to applying
a heuristic string extraction algorithm to extract as much
human-readable text as possible from the binary stream, which works
quite well on MS Office files, i.e. the output is reasonably well for
indexing purposes.

Be sure to checkout Aperture from CVS as this code isn't part of the
alpha 1 release. A next official release is expected in a month.

[hidden email] wrote:
> RTF - javax.swing looks fine, we use those classes already.

Swing's RTFEditorKit does indeed work surpringly well. "Surprisingly"
because in the past I had many issues with it, typically throwing
exceptions on 25-50% of my test documents. Recently I haven't seen a
single one (using Java 1.5.0), so either I am now feeding it a more
optimal document set or the Swing people have worked on the
implementation. In that case people using Java 1.4.x may see different
results.

> Word Perfect - There doesn't seem to be any converters for this format?

I'm actively working on this :) We have some proprietary code that will
become part of Aperture. Right now I cannot say how well it performs in
practice though, although we've never had complaints with our
proprietary apps.

The code uses a heuristic string extraction algorithm tuned for
WordPerfect documents. This may be an issue, e.g. when you also want to
display the extraction results to end users.

If you're interested: one way you can help me get the most out of it is
by sending me some example WordPerfect documents because I hardly have
those on my hard drive. Fake documents made with very new or old
WordPerfect versions are also most welcome.


Regards,

Chris
http://aduna.biz
--

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Word files & Build vs. Buy?

Christiaan Fluit
Dmitry Goldenberg wrote:
> Awesome stuff. A few questions: is your Excel extractor somehow
> better than POI's? and, what do you see as the timeframe for adding
> WordPerfect support? Are you considering supporting any other sources
> such as MS Project, Framemaker, etc?

I just committed a WordPerfectExtractor ;)

It's based on code developed in-house at Aduna and it seems to work
quite well on my test collection of WordPerfect documents. Only
sometimes words are split in the middle, I'm still looking into that.

The test set has a bias for older WordPerfect documents though, I'm
trying to get my hands on a recent copy of WordPerfect to see if the
latest format is also supported and to create unit tests for it.

To interactively test the extractor(s) yourselves:

- checkout Aperture from CVS (see
http://sourceforge.net/cvs/?group_id=150969)
- do "ant release"
- go to build\release\bin and execute fileinspector.bat
- drag any file (WordPerfect or any other format) to see what MIME type
Aperture thinks it is and to execute the corresponding Extractor, if
available. The two tabs show the extracted full-text and an RDF dump of
the metadata. For WordPerfect, only full-text extraction is currently
supported.

Our ExcelExtractor is basically nothing more than glue code between POI
and the rest of our framework, meaning that an application using the
framework can request an Extractor implementation for
"application/vnd.ms-excel", feed it an InputStream and get the text and
metadata back.

The only advantage of our ExcelExtractor over direct use of POI is that,
when POI throws an Exception on a particular document, it reverts to a
heuristic string extraction algorithm which is often able to extract
full-text from a document with reasonable quality, i.e. suited for indexing.

We are surely considering supporting more formats. Which ones we will
work on depends on a number of factors, e.g. availability of open source
libs for that format, complexity of the file format (we did WordPerfect
by ourselves), customer demand, code contributions from others, etc. In
any case, if you need support for format XYZ, you can always send me
some example files and I'll take a look at how hard it is to add support
for it.


Chris
--

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Word files & Build vs. Buy?

Nick Burch
In reply to this post by Christiaan Fluit
On Thu, 9 Feb 2006, Christiaan Fluit wrote:
> Yes, that's exactly what I'm doing. Having this in POI would benefit me
> a lot though, as I hardly understand the POI basics to be honest (my
> fault, not POI's).

OK, that's now in POI (you'll need a scratchpad build from late yesterday
or today, see http://encore.torchbox.com/poi-cvs-build/ for jars)

The code is in org.apache.poi.hwpf.extractor.WordExtractor, and it
supports grabbing all the text, or grabbing an array of the text in each
paragraph

If you have any problems/queries/comments on it, then you'll probably get
a better response on poi-user than here!

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]