Lucene - FileFormat

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Lucene - FileFormat

Fisheye
Im trying to construct a plaintext parser for different file formats like ms word, excel, powerpoint, rich text format, plain text, html, pdf etc.

I use the known libraries PDFBox, POI and some parts from AtLeap...and now I should support the OpenOffice formats and the more important msg-fromat (MS outlook message format).

Does someone know how I can simply (like POI) extract plaint text from msg? Probably there exists an open source library like for pdf or ms office files?

I need the plain text because the only way for me seems to extract all the plain text from every single document, and then add it to my lucene index...this is necessary to get the best excerpt from highlighter...

Thx

Simon Dietschi
Reply | Threaded
Open this post in threaded view
|

Lucene, TREC, and WT10G

Trung-2
Hi all,

Did anyone use Lucene to index WT10G? Can it index
WT10G in compressed format (.gz) or we have to unzip
it first?

Further more, does Lucene support TREC format? I mean
can it receive a topic file like "<TOP> <NUM> 1
<TITLE>  abc def </TOP>" and produce a results file
which we can use  with trec_eval program?

Any help will be appretiated,
Thanh



       


       
               
________________________________________________________
Bạn có sử dụng Yahoo! không?
Hãy xem thử trang chủ Yahoo! Việt Nam!
http://vn.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Lucene - FileFormat

Dmitry Goldenberg
In reply to this post by Fisheye
Simon,
 
I wonder if using Zoe might do the trick - http://guests.evectors.it/zoe/
Have you tried it?
 
- Dmitry

________________________________

From: Fisheye [mailto:[hidden email]]
Sent: Fri 4/21/2006 7:23 AM
To: [hidden email]
Subject: Lucene - FileFormat




Im trying to construct a plaintext parser for different file formats like ms
word, excel, powerpoint, rich text format, plain text, html, pdf etc.

I use the known libraries PDFBox, POI and some parts from AtLeap...and now I
should support the OpenOffice formats and the more important msg-fromat (MS
outlook message format).

Does someone know how I can simply (like POI) extract plaint text from msg?
Probably there exists an open source library like for pdf or ms office
files?

I need the plain text because the only way for me seems to extract all the
plain text from every single document, and then add it to my lucene
index...this is necessary to get the best excerpt from highlighter...

Thx

Simon Dietschi
--
View this message in context: http://www.nabble.com/Lucene---FileFormat-t1485959.html#a4024568
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Lucene, TREC, and WT10G

trupti mulajkar
In reply to this post by Trung-2
Lucene can index the trec documents, but depends how you want to index them.
If you want to index the sub files in the TREC DAtA then you have to modify the
IndexFiles.java to read the tags else you can index them normally.
 
cheers,
trupti mulajkar

Quoting thanh nguyen <[hidden email]>:

> Hi all,
>
> Did anyone use Lucene to index WT10G? Can it index
> WT10G in compressed format (.gz) or we have to unzip
> it first?
>
> Further more, does Lucene support TREC format? I mean
> can it receive a topic file like "<TOP> <NUM> 1
> <TITLE>  abc def </TOP>" and produce a results file
> which we can use  with trec_eval program?
>
> Any help will be appretiated,
> Thanh
>
>
>
>
>
>
>
>
> ________________________________________________________
> Bạn có sử dụng Yahoo! không?
> Hãy xem thử trang chủ Yahoo! Việt Nam!
> http://vn.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene, TREC, and WT10G

Trung-2
Hi trupti,

Thank for your response. I have another question.
Whether Lucene can receive a topic file  like "<TOP>
<NUM> 1 <TITLE>  abc def </TOP>" and produce a
result_file  which we can use  with trec_eval program
(trec_eval relevant_file result_file , relevant_file
is the judgement file of TREC for these topic) ??

Thank you in advance,
Thanh.


--- trupti mulajkar <[hidden email]> đã
viết:  

> Lucene can index the trec documents, but depends how
> you want to index them.
> If you want to index the sub files in the TREC DAtA
> then you have to modify the
> IndexFiles.java to read the tags else you can index
> them normally.
>  
> cheers,
> trupti mulajkar
>
> Quoting thanh nguyen <[hidden email]>:
>
> > Hi all,
> >
> > Did anyone use Lucene to index WT10G? Can it index
> > WT10G in compressed format (.gz) or we have to
> unzip
> > it first?
> >
> > Further more, does Lucene support TREC format? I
> mean
> > can it receive a topic file like "<TOP> <NUM> 1
> > <TITLE>  abc def </TOP>" and produce a results
> file
> > which we can use  with trec_eval program?
> >
> > Any help will be appretiated,
> > Thanh
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
________________________________________________________
>
> > Bạn có sử dụng Yahoo! không?
> > Hãy xem thử trang chủ Yahoo!
> Việt Nam!
> > http://vn.yahoo.com
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> [hidden email]
> > For additional commands, e-mail:
> [hidden email]
> >
> >
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [hidden email]
> For additional commands, e-mail:
> [hidden email]
>
>


__________________________________________________
Bạn Có Sử Dụng Yahoo! Không?
Mệt mỏi vì thư rác?  Yahoo! Thư có chương trình bảo vệ chống thư rác hữu hiệu nhất trên mạng
http://vn.mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene, TREC, and WT10G

Grant Ingersoll
It is up to you to create a program to do this, but it is relatively
easy.  You may want to search the web, chances are someone has posted
code to do this, as a number of people have used Lucene in TREC in the past.

Good luck,
Grant

thanh nguyen wrote:

> Hi trupti,
>
> Thank for your response. I have another question.
> Whether Lucene can receive a topic file  like "<TOP>
> <NUM> 1 <TITLE>  abc def </TOP>" and produce a
> result_file  which we can use  with trec_eval program
> (trec_eval relevant_file result_file , relevant_file
> is the judgement file of TREC for these topic) ??
>
> Thank you in advance,
> Thanh.
>
>
> --- trupti mulajkar <[hidden email]> đã
> viết:  
>
>  
>> Lucene can index the trec documents, but depends how
>> you want to index them.
>> If you want to index the sub files in the TREC DAtA
>> then you have to modify the
>> IndexFiles.java to read the tags else you can index
>> them normally.
>>  
>> cheers,
>> trupti mulajkar
>>
>> Quoting thanh nguyen <[hidden email]>:
>>
>>    
>>> Hi all,
>>>
>>> Did anyone use Lucene to index WT10G? Can it index
>>> WT10G in compressed format (.gz) or we have to
>>>      
>> unzip
>>    
>>> it first?
>>>
>>> Further more, does Lucene support TREC format? I
>>>      
>> mean
>>    
>>> can it receive a topic file like "<TOP> <NUM> 1
>>> <TITLE>  abc def </TOP>" and produce a results
>>>      
>> file
>>    
>>> which we can use  with trec_eval program?
>>>
>>> Any help will be appretiated,
>>> Thanh
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>      
> ________________________________________________________
>  
>>> Bạn có sử dụng Yahoo! không?
>>> Hãy xem thử trang chủ Yahoo!
>>>      
>> Việt Nam!
>>    
>>> http://vn.yahoo.com
>>>
>>>
>>>      
> ---------------------------------------------------------------------
>  
>>> To unsubscribe, e-mail:
>>>      
>> [hidden email]
>>    
>>> For additional commands, e-mail:
>>>      
>> [hidden email]
>>    
>>>      
>>    
> ---------------------------------------------------------------------
>  
>> To unsubscribe, e-mail:
>> [hidden email]
>> For additional commands, e-mail:
>> [hidden email]
>>
>>
>>    
>
>
> __________________________________________________
> Bạn Có Sử Dụng Yahoo! Không?
> Mệt mỏi vì thư rác?  Yahoo! Thư có chương trình bảo vệ chống thư rác hữu hiệu nhất trên mạng
> http://vn.mail.yahoo.com 
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org 
Voice:  315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]