indexing TREC

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

indexing TREC

xx28
Hi -

Is there any TREC parser (indexing many documents that are in the same file)for Lucene avaiable?

Thanks,

George


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing TREC

Erik Hatcher

On May 7, 2005, at 10:24 PM, [hidden email] wrote:

> Hi -
>
> Is there any TREC parser (indexing many documents that are in the  
> same file)for Lucene avaiable?

Not to my knowledge.  I've been working on indexing some TREC data  
myself, with some (not written by me) code to do simplistic grabbing  
of a couple of fields from the SGML.

If there is something handy to parse these files, I'd love to give it  
a spin though.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing TREC

Ian Soboroff
Quoting Erik Hatcher <[hidden email]>:

>
> On May 7, 2005, at 10:24 PM, [hidden email] wrote:
>
> > Hi -
> >
> > Is there any TREC parser (indexing many documents that are in the  
> > same file)for Lucene avaiable?
>
> Not to my knowledge.  I've been working on indexing some TREC data  
> myself, with some (not written by me) code to do simplistic grabbing  
> of a couple of fields from the SGML.
>
> If there is something handy to parse these files, I'd love to give it  
> a spin though.

We at NIST have code to do this, integrated with Lucene.  We have a simple
SGML/XML parser and a simple reader which walks the documents... that's most of
what you need.  You could probably hack it together with NekoHTML in a day or
two, actually.  It can index TREC newswire corpora in a couple hours on a pretty
stock Dell workstation, and the TREC "Terabyte" collection (actually 480GB, 25
million web pages) in a weekend on a handful of workstations indexing in parallel.

It's not quite ready for prime-time release yet, though.  I hope to have some
time this summer to clean it up enough that we're not embarrassed to give it out
;-).  If your need is desperate, drop me an email and I'll see what I can do.

Ian



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]