Using Lucene to index Meta-data from txt, html, PDF etc files.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Using Lucene to index Meta-data from txt, html, PDF etc files.

Aditya Gollakota
Hi Guys,

 

Just wondering how you would go about indexing meta-data from files. I've
used the demo package IndexHTMLjava and have updated the HTMLDocument.java
with the following:

 

DataInput input = new DataInputStream(new BufferedInputStream(new
FileInputStream(f)));

Content content = Content.read(input);

Reader contentReader = new ArrayFile.Reader(new LocalFileSystem(null),new
File(f.getPath(), Content.DIR_NAME).toString(), null);

   

System.out.println(content);

ParseData parseData = ParseData.read(input);

Metadata metadata = parseData.getContentMeta();

 

doc.add(new Field("keywords", metadata.KEYWORDS, Field.Store.YES,
Field.Index.NO));

 

I'm using the nutch-0.8.jar for the Metadata Class and have used the jars of
nutch to resolve any exceptions and also Lucene-2.0.0

 

While compiling this code, I'm getting the following error:

 

A record version mismatch occurred. Expecting v1, found v118.

 

Any help would be much appreciated.

 

Regards,

 

Aditya Gollakota
Support Engineer | CustomWare Asia Pacific | www.customware.net
T: +61 2 9900 5742 | F: +61 2 9475 0100 | M: +61 405 033 951
E: [hidden email]

 

Loading...