Looking to Index Various Document Types.

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Looking to Index Various Document Types.

DURGA DEEP
 HI Folks,

I was looking at the Lucene FAQ and I found this very interesting.
How can I index OpenOffice.org files?

These files (.sxw, .sxc, etc) are ZIP archives that contain XML files.
Uncompress the file using Java's ZIP support, then parse meta.xml to get
title etc. and content.xml to get the document's content. Add these to the
Lucene index, typically using one Lucene field per property.

Note that this applies to OpenOffice.org 1.x, things have changed a bit for
OpenOffice.org 2.x, but the basic approach is still the same.

You can also use LIUS framework for indexing
OpenOffice<http://wiki.apache.org/lucene-java/OpenOffice>documents([image:
[WWW]] http://www.bibl.ulaval.ca/lius/ <http://www.bibl.ulaval.ca/lius/>).
LIUS allow metadata and fulltext indexing, using XPath.

But the problem is that I was not able to find more information on
http://www.bibl.ulaval.ca/lius/
Had any one had better luck on finding more information on Using Luis ?.
Also please suggest any alternatives if Luis is no longer available.
We have the following documents PDF / MS Documents etc.. in the pipeline
that needs to be indexed

Thanks Much
-DD
Reply | Threaded
Open this post in threaded view
|

RE: Looking to Index Various Document Types.

steve_rowe
'sup, DD:

You should have posted your question, which is about *using* Lucene, to the java-user mailing list; the java-dev mailing list is instead intended for discussion of *development of* Lucene.

Here's a Lius tutorial, in both French and English:

http://www.doculibre.com/lius/

And here's a discussion of using Solr to index OpenOffice.org docs - basically done by unzipping and XSLT to create per-field data:

<http://wiki.apache.org/cocoon-data/attachments/GT2006Notes/attachments/13-SubversionSolr.pdf>

Aperture <http://aperture.sf.net> claims to support OO.o 1.X files - you might look there if you don't want to roll your own solution.

Steve

On 03/12/2008 at 3:12 PM, DURGA DEEP wrote:
>  HI Folks,
>
> I was looking at the Lucene FAQ and I found this very interesting.
> How can I index OpenOffice.org files?
[...]
> But the problem is that I was not able to find more information on
> http://www.bibl.ulaval.ca/lius/ Had any one had better luck on finding
> more information on Using Luis ?. Also please suggest any alternatives
> if Luis is no longer available. We have the following documents PDF / MS
> Documents etc.. in the pipeline that needs to be indexed
>
> Thanks Much
> -DD
>

 


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]