Index remotely documents

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Index remotely documents

Hi to all

how i can index remotely documents(PDF, HTML, XML)?

i use lucene 2.0.0

i use current

java org.w3c.tidy.Tidy -m *.html to parser HTML

java org.apache.lucene.demo.IndexHTML -create -index index .\   for index HTML

java org.pdfbox.searchengine.lucene.IndexFiles -create -index C:\tomcat\webapps\luceneweb\index .\ for index PDF

but how i can parser XML?

i use

java dom.DOMFilter *.xml

but how i can index XML