Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?

Lukáš Vlček
Hi,

I need to find a reliable way how to extract content out of Word, Excel and
PowerPoint formats prior to indexing and I am not sure if POI is the best
way to go. Can anybody share experience with POI and/or other [commercial]
Java library for text extraction from MS formats?

My experience with POI is such that sometimes it can be a pain to get the
content out of the MS files properly. I also know that Nutch plugin uses POI
for MS formats but as far as I remember it is not 100% reliable (my more
then one year old experience is that about 1-2% of files were not parsed).

My requirements are that the text extraction software must run on Linux and
should be written in Java, it can be open source or commercial library.

Regards,
Lukas

--
http://blog.lukas-vlcek.com/
Reply | Threaded
Open this post in threaded view
|

Re: Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?

Nick Burch
On Mon, 12 May 2008, Lukas Vlcek wrote:
> I need to find a reliable way how to extract content out of Word, Excel
> and PowerPoint formats prior to indexing and I am not sure if POI is the
> best way to go. Can anybody share experience with POI and/or other
> [commercial] Java library for text extraction from MS formats?

We use poi for text extraction, and it works just fine for us. POI 3.1
should offer a few improvements on text extraction, and POI 3.5 will give
you OOXML text extraction too.

You might also like to take a look at Apache Tika
<http://incubator.apache.org/tika/>. It wraps up POI (and a few other
document extractor libraries), giving you a simple, common interface for
text extraction

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?

Karl Wettin
In reply to this post by Lukáš Vlček
Lukas Vlcek skrev:
> Hi,
>
> I need to find a reliable way how to extract content out of Word, Excel and
> PowerPoint formats prior to indexing and I am not sure if POI is the best
> way to go. Can anybody share experience with POI and/or other [commercial]
> Java library for text extraction from MS formats?

I like Antiword for .doc files.

http://www.winfield.demon.nl/


        karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: Can POI provide reliable text extraction results for productionsearch engine for Word, Excel and PowerPoint formats?

adb
In reply to this post by Lukáš Vlček
We are using POI 3.0.2 FINAL.  Like you, it is not very reliable for many Word
files.  It does not support Word 2, Fast saved files, files which are not padded
to 256 bytes.  PPT and Excel are quite bad, a large % of our PPT files throw
Exceptions.  Not tried 3.1 as it's just gone BETA 1, but I expect that the Word
parsing is unchanged and the changelog doesn't show any Word changes.

TestMining.org http://www.textmining.org/ is quite good, but the 0.4 version did
not do Word 2 or Fast Saved files.  1.0 version should fix that, but I've not
yet tried it.  Licene for 1.0 is LGPL, whereas 0.4 was Apache 2.

AbiWord http://www.abisource.com/ is pretty good, but it's a complete GUI so is
quite slow if you want to use it for a lot of parsing.  It can do text
extraction via the command line.  The Linux versions suports pipes.    It's
based on WvWare http://wvware.sourceforge.net/

Catdoc (http://ftp.wagner.pp.ru/~vitus/software/catdoc/) is quite effective,
fast.  It also has catppt.  I'm not sure if the text order is 100% according to
the original though.

The last two are not licence friendly for distribution.

I've extracted the Nutch parsing framework and am using it in our product and
have tested all of the above and the priority for Word parsing is TextMining
v0.4, before POI and then the other two which I plugged in via the parse-ext parser.

HTH
Antony





Lukas Vlcek wrote:

> Hi,
>
> I need to find a reliable way how to extract content out of Word, Excel and
> PowerPoint formats prior to indexing and I am not sure if POI is the best
> way to go. Can anybody share experience with POI and/or other [commercial]
> Java library for text extraction from MS formats?
>
> My experience with POI is such that sometimes it can be a pain to get the
> content out of the MS files properly. I also know that Nutch plugin uses POI
> for MS formats but as far as I remember it is not 100% reliable (my more
> then one year old experience is that about 1-2% of files were not parsed).
>
> My requirements are that the text extraction software must run on Linux and
> should be written in Java, it can be open source or commercial library.
>
> Regards,
> Lukas
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Can POI provide reliable text extraction results for productionsearch engine for Word, Excel and PowerPoint formats?

mark harwood
On the commercial front, Oracle's "Outside In" (previously Stellent) is the one that gets used in a lot of search engines.

Being a C-based product though, integration isn't quite as nice/easy as pure Java solutions.


----- Original Message ----
From: Bowesman Antony <[hidden email]>
To: [hidden email]
Sent: Tuesday, 13 May, 2008 8:49:00 AM
Subject: Re: Can POI provide reliable text extraction results for productionsearch engine for Word, Excel and PowerPoint formats?

We are using POI 3.0.2 FINAL.  Like you, it is not very reliable for many Word
files.  It does not support Word 2, Fast saved files, files which are not padded
to 256 bytes.  PPT and Excel are quite bad, a large % of our PPT files throw
Exceptions.  Not tried 3.1 as it's just gone BETA 1, but I expect that the Word
parsing is unchanged and the changelog doesn't show any Word changes.

TestMining.org http://www.textmining.org/ is quite good, but the 0.4 version did
not do Word 2 or Fast Saved files.  1.0 version should fix that, but I've not
yet tried it.  Licene for 1.0 is LGPL, whereas 0.4 was Apache 2.

AbiWord http://www.abisource.com/ is pretty good, but it's a complete GUI so is
quite slow if you want to use it for a lot of parsing.  It can do text
extraction via the command line.  The Linux versions suports pipes.    It's
based on WvWare http://wvware.sourceforge.net/

Catdoc (http://ftp.wagner.pp.ru/~vitus/software/catdoc/) is quite effective,
fast.  It also has catppt.  I'm not sure if the text order is 100% according to
the original though.

The last two are not licence friendly for distribution.

I've extracted the Nutch parsing framework and am using it in our product and
have tested all of the above and the priority for Word parsing is TextMining
v0.4, before POI and then the other two which I plugged in via the parse-ext parser.

HTH
Antony





Lukas Vlcek wrote:

> Hi,
>
> I need to find a reliable way how to extract content out of Word, Excel and
> PowerPoint formats prior to indexing and I am not sure if POI is the best
> way to go. Can anybody share experience with POI and/or other [commercial]
> Java library for text extraction from MS formats?
>
> My experience with POI is such that sometimes it can be a pain to get the
> content out of the MS files properly. I also know that Nutch plugin uses POI
> for MS formats but as far as I remember it is not 100% reliable (my more
> then one year old experience is that about 1-2% of files were not parsed).
>
> My requirements are that the text extraction software must run on Linux and
> should be written in Java, it can be open source or commercial library.
>
> Regards,
> Lukas
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


      __________________________________________________________
Sent from Yahoo! Mail.
A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Can POI provide reliable text extraction results for productionsearch engine for Word, Excel and PowerPoint formats?

Robert.Hastings
We are using Aspose: www.aspose.com.  We are still in pre-release, it
works fine for all of the MS products.  It's commercial, but is a good
deal as long as you don't have too many developers working on it, since
the licensing is per seat.  We had a little trouble with thier PDF
product.  The other thing is that their main product line is .NET but the
Java line has kept up pretty well.  For text extraction the APIs are
straight forward.





mark harwood <[hidden email]>
05/13/2008 07:44 AM
Please respond to
[hidden email]


To
[hidden email]
cc

Subject
Re: Can POI provide reliable text extraction results for productionsearch
engine for Word, Excel and PowerPoint formats?






On the commercial front, Oracle's "Outside In" (previously Stellent) is
the one that gets used in a lot of search engines.

Being a C-based product though, integration isn't quite as nice/easy as
pure Java solutions.


----- Original Message ----
From: Bowesman Antony <[hidden email]>
To: [hidden email]
Sent: Tuesday, 13 May, 2008 8:49:00 AM
Subject: Re: Can POI provide reliable text extraction results for
productionsearch engine for Word, Excel and PowerPoint formats?

We are using POI 3.0.2 FINAL.  Like you, it is not very reliable for many
Word
files.  It does not support Word 2, Fast saved files, files which are not
padded
to 256 bytes.  PPT and Excel are quite bad, a large % of our PPT files
throw
Exceptions.  Not tried 3.1 as it's just gone BETA 1, but I expect that the
Word
parsing is unchanged and the changelog doesn't show any Word changes.

TestMining.org http://www.textmining.org/ is quite good, but the 0.4
version did
not do Word 2 or Fast Saved files.  1.0 version should fix that, but I've
not
yet tried it.  Licene for 1.0 is LGPL, whereas 0.4 was Apache 2.

AbiWord http://www.abisource.com/ is pretty good, but it's a complete GUI
so is
quite slow if you want to use it for a lot of parsing.  It can do text
extraction via the command line.  The Linux versions suports pipes. It's
based on WvWare http://wvware.sourceforge.net/

Catdoc (http://ftp.wagner.pp.ru/~vitus/software/catdoc/) is quite
effective,
fast.  It also has catppt.  I'm not sure if the text order is 100%
according to
the original though.

The last two are not licence friendly for distribution.

I've extracted the Nutch parsing framework and am using it in our product
and
have tested all of the above and the priority for Word parsing is
TextMining
v0.4, before POI and then the other two which I plugged in via the
parse-ext parser.

HTH
Antony





Lukas Vlcek wrote:
> Hi,
>
> I need to find a reliable way how to extract content out of Word, Excel
and
> PowerPoint formats prior to indexing and I am not sure if POI is the
best
> way to go. Can anybody share experience with POI and/or other
[commercial]
> Java library for text extraction from MS formats?
>
> My experience with POI is such that sometimes it can be a pain to get
the
> content out of the MS files properly. I also know that Nutch plugin uses
POI
> for MS formats but as far as I remember it is not 100% reliable (my more
> then one year old experience is that about 1-2% of files were not
parsed).
>
> My requirements are that the text extraction software must run on Linux
and
> should be written in Java, it can be open source or commercial library.
>
> Regards,
> Lukas
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


      __________________________________________________________
Sent from Yahoo! Mail.
A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?

Grant Ingersoll-2
In reply to this post by Lukáš Vlček
I've used POI, as well as commercial providers.  As always, it  
depends :-)  I wasn't particularly impressed with the commercial  
providers given the amount of money they wanted for it.   PDF was  
particularly tricky, but you weren't asking about that.   At least w/  
POI, you have the opportunity to fix things that don't work based on  
your priorities.  I don't know what the failure rate is for the  
commercial providers, but my experience is they will all fail at least  
once, so you better plan on it.  I'd look to use a framework like Tika  
or Aperture, where you can easily upgrade or plug in new or different  
libraries (including commercial providers) as needed w/o rewriting  
your code.  Additionally, with something like Tika or Aperture, you  
could easily mix and match your solutions, such that you use one for  
Word and a different one for PPT or PDF.

One issue with any of them is how you plan to use them.  If you need  
more than bag of words, they all get less reliable, especially when it  
comes to PDFs and Office docs.  Dealing with things like tables,  
columns, captions, labels, etc. has always been problematic in my  
experience when one wants to do higher level processing (beyond  
keyword search).

HTH,
Grant

On May 12, 2008, at 10:03 AM, Lukas Vlcek wrote:

> Hi,
>
> I need to find a reliable way how to extract content out of Word,  
> Excel and
> PowerPoint formats prior to indexing and I am not sure if POI is the  
> best
> way to go. Can anybody share experience with POI and/or other  
> [commercial]
> Java library for text extraction from MS formats?
>
> My experience with POI is such that sometimes it can be a pain to  
> get the
> content out of the MS files properly. I also know that Nutch plugin  
> uses POI
> for MS formats but as far as I remember it is not 100% reliable (my  
> more
> then one year old experience is that about 1-2% of files were not  
> parsed).
>
> My requirements are that the text extraction software must run on  
> Linux and
> should be written in Java, it can be open source or commercial  
> library.
>
> Regards,
> Lukas
>
> --
> http://blog.lukas-vlcek.com/

--------------------------
Grant Ingersoll
http://lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?

Andrzej Białecki-2
Grant Ingersoll wrote:

> I've used POI, as well as commercial providers.  As always, it depends
> :-)  I wasn't particularly impressed with the commercial providers given
> the amount of money they wanted for it.   PDF was particularly tricky,
> but you weren't asking about that.   At least w/ POI, you have the
> opportunity to fix things that don't work based on your priorities.  I
> don't know what the failure rate is for the commercial providers, but my
> experience is they will all fail at least once, so you better plan on
> it.  I'd look to use a framework like Tika or Aperture, where you can
> easily upgrade or plug in new or different libraries (including
> commercial providers) as needed w/o rewriting your code.  Additionally,
> with something like Tika or Aperture, you could easily mix and match
> your solutions, such that you use one for Word and a different one for
> PPT or PDF.
>
> One issue with any of them is how you plan to use them.  If you need
> more than bag of words, they all get less reliable, especially when it
> comes to PDFs and Office docs.  Dealing with things like tables,
> columns, captions, labels, etc. has always been problematic in my
> experience when one wants to do higher level processing (beyond keyword
> search).

Yet another option ... In the past I used a licensed copy of MS Office
to extract things that I wanted, using a bit of OLE automation and
VBscript. Worked reasonably well, in the sense that I had no issues
whatsoever with extracting the content _and_ formatting from any
documents that could be normally opened with MS Office - however,
performance was an issue, ie. it was slow, CPU/memory hog, and
occasionally it would get stuck in a weird state when only complete
reboot would help.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?

Lukáš Vlček
Does it make sense to consider using OpenOffice to convert from MS formats
to PDF or HTML before indexing. Would this yield me a lower fail rate as
opposed to pure POI approach? I don't care about formating now I care about
content in the first place. Formating would be important only in the case
that Nutch or other piece of software would be able to accommodate this
information into Lucene index (such that words in headline would yield
higher boost for example).

Couple of words about my motivation:

We released SharePoint 2007 in our company. We are not very satisfied with
its search capabilities so I started to looking for some alternatives. The
first thing I was looking at is Google Search Appliance as they claim it can
crawl, index and search SharePoint portals.

I realized that their integration with outer world is done via connector
manager which is itself an open source project written in Java and
Sharepoint connector implementation is as well released as a open source in
Java. This makes me think that I should be able to test Sharepoint connector
replacing GSA black box with Lucene,Nutch,Solr or whatever and test how well
this connector thing works. This would be a perfect test before we invest
more in GSA. On the other hand if I would be able to run the Sharepoint
connector without GSA (replaced by Lucene based product) then text
extraction from MS family formats can be the main impenetrable barrier.

Lukas

On Tue, May 13, 2008 at 4:13 PM, Andrzej Bialecki <[hidden email]> wrote:

> Grant Ingersoll wrote:
>
> > I've used POI, as well as commercial providers.  As always, it depends
> > :-)  I wasn't particularly impressed with the commercial providers given the
> > amount of money they wanted for it.   PDF was particularly tricky, but you
> > weren't asking about that.   At least w/ POI, you have the opportunity to
> > fix things that don't work based on your priorities.  I don't know what the
> > failure rate is for the commercial providers, but my experience is they will
> > all fail at least once, so you better plan on it.  I'd look to use a
> > framework like Tika or Aperture, where you can easily upgrade or plug in new
> > or different libraries (including commercial providers) as needed w/o
> > rewriting your code.  Additionally, with something like Tika or Aperture,
> > you could easily mix and match your solutions, such that you use one for
> > Word and a different one for PPT or PDF.
> >
> > One issue with any of them is how you plan to use them.  If you need
> > more than bag of words, they all get less reliable, especially when it comes
> > to PDFs and Office docs.  Dealing with things like tables, columns,
> > captions, labels, etc. has always been problematic in my experience when one
> > wants to do higher level processing (beyond keyword search).
> >
>
> Yet another option ... In the past I used a licensed copy of MS Office to
> extract things that I wanted, using a bit of OLE automation and VBscript.
> Worked reasonably well, in the sense that I had no issues whatsoever with
> extracting the content _and_ formatting from any documents that could be
> normally opened with MS Office - however, performance was an issue, ie. it
> was slow, CPU/memory hog, and occasionally it would get stuck in a weird
> state when only complete reboot would help.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
http://blog.lukas-vlcek.com/
Reply | Threaded
Open this post in threaded view
|

Re: Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?

Jay O'Leary
If it's windows only, you can roll your own with IFilters (
http://www.ifilter.org/).

On Tue, May 13, 2008 at 10:23 AM, Lukas Vlcek <[hidden email]> wrote:

> Does it make sense to consider using OpenOffice to convert from MS formats
> to PDF or HTML before indexing. Would this yield me a lower fail rate as
> opposed to pure POI approach? I don't care about formating now I care
> about
> content in the first place. Formating would be important only in the case
> that Nutch or other piece of software would be able to accommodate this
> information into Lucene index (such that words in headline would yield
> higher boost for example).
>
> Couple of words about my motivation:
>
> We released SharePoint 2007 in our company. We are not very satisfied with
> its search capabilities so I started to looking for some alternatives. The
> first thing I was looking at is Google Search Appliance as they claim it
> can
> crawl, index and search SharePoint portals.
>
> I realized that their integration with outer world is done via connector
> manager which is itself an open source project written in Java and
> Sharepoint connector implementation is as well released as a open source
> in
> Java. This makes me think that I should be able to test Sharepoint
> connector
> replacing GSA black box with Lucene,Nutch,Solr or whatever and test how
> well
> this connector thing works. This would be a perfect test before we invest
> more in GSA. On the other hand if I would be able to run the Sharepoint
> connector without GSA (replaced by Lucene based product) then text
> extraction from MS family formats can be the main impenetrable barrier.
>
> Lukas
>
> On Tue, May 13, 2008 at 4:13 PM, Andrzej Bialecki <[hidden email]> wrote:
>
> > Grant Ingersoll wrote:
> >
> > > I've used POI, as well as commercial providers.  As always, it depends
> > > :-)  I wasn't particularly impressed with the commercial providers
> given the
> > > amount of money they wanted for it.   PDF was particularly tricky, but
> you
> > > weren't asking about that.   At least w/ POI, you have the opportunity
> to
> > > fix things that don't work based on your priorities.  I don't know
> what the
> > > failure rate is for the commercial providers, but my experience is
> they will
> > > all fail at least once, so you better plan on it.  I'd look to use a
> > > framework like Tika or Aperture, where you can easily upgrade or plug
> in new
> > > or different libraries (including commercial providers) as needed w/o
> > > rewriting your code.  Additionally, with something like Tika or
> Aperture,
> > > you could easily mix and match your solutions, such that you use one
> for
> > > Word and a different one for PPT or PDF.
> > >
> > > One issue with any of them is how you plan to use them.  If you need
> > > more than bag of words, they all get less reliable, especially when it
> comes
> > > to PDFs and Office docs.  Dealing with things like tables, columns,
> > > captions, labels, etc. has always been problematic in my experience
> when one
> > > wants to do higher level processing (beyond keyword search).
> > >
> >
> > Yet another option ... In the past I used a licensed copy of MS Office
> to
> > extract things that I wanted, using a bit of OLE automation and
> VBscript.
> > Worked reasonably well, in the sense that I had no issues whatsoever
> with
> > extracting the content _and_ formatting from any documents that could be
> > normally opened with MS Office - however, performance was an issue, ie.
> it
> > was slow, CPU/memory hog, and occasionally it would get stuck in a weird
> > state when only complete reboot would help.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
>  --
> http://blog.lukas-vlcek.com/
>