delete unnecessary files after optimize()

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

delete unnecessary files after optimize()

Koji Sekiguchi-4
Hello,

My Tomcat application has several threads. These threads
share a single instance of IndexSearcher to seach contents.

At some point in time, I have the following index directory:

-rwx------+ 1 admin admin 158622 Oct 16 10:21 _1pp.cfs
-rwx------+ 1 admin admin 158622 Oct 16 10:21 _2kk.cfs
-rwx------+ 1 admin admin 158622 Oct 16 10:21 _3ff.cfs
-rwx------+ 1 admin admin 158622 Oct 16 10:21 _4aa.cfs
-rwx------+ 1 admin admin 158614 Oct 16 10:20 _uu.cfs
-rwx------+ 1 admin admin       4 Oct 16 10:21 deletable
-rwx------+ 1 admin admin     64 Oct 16 10:21 segments

In this moment, I want to optimize() the index. I can do it safely
without interrupting Tomcat process.
After optimizing the index, I get a new compounf file _4ab.cfs:

-rwx------+ 1 admin admin 158622 Oct 16 10:21 _1pp.cfs
-rwx------+ 1 admin admin 158622 Oct 16 10:21 _2kk.cfs
-rwx------+ 1 admin admin 158622 Oct 16 10:21 _3ff.cfs
-rwx------+ 1 admin admin 158622 Oct 16 10:21 _4aa.cfs
-rwx------+ 1 admin admin 791622 Oct 16 10:21 _4ab.cfs
-rwx------+ 1 admin admin 158614 Oct 16 10:20 _uu.cfs
-rwx------+ 1 admin admin     48 Oct 16 10:21 deletable
-rwx------+ 1 admin admin     29 Oct 16 10:21 segments

Now I can let threads of Tomcat know that we have a new compound
file so that servlet can reopen IndexSearcher to use new segments.
But I want to delete old and unnecessary files (_1pp, _2kk,
_3ff, _4aa and _uu .cfs files) after reopening IndexSearcher
to save disk space.

How can I get a list of unnecessary files to delete them?

regards,

Koji




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: delete unnecessary files after optimize()

Koji Sekiguchi-4
Hi again,

I've read http://lucene.apache.org/java/docs/fileformats.html
and now I think I know deletable file format.

> How can I get a list of unnecessary files to delete them?

I can get such information from deletable file under Win32 environment,
correct?

Koji

> -----Original Message-----
> From: Koji Sekiguchi [mailto:[hidden email]]
> Sent: Sunday, October 16, 2005 11:05 AM
> To: [hidden email]
> Subject: delete unnecessary files after optimize()
>
>
> Hello,
>
> My Tomcat application has several threads. These threads
> share a single instance of IndexSearcher to seach contents.
>
> At some point in time, I have the following index directory:
>
> -rwx------+ 1 admin admin 158622 Oct 16 10:21 _1pp.cfs
> -rwx------+ 1 admin admin 158622 Oct 16 10:21 _2kk.cfs
> -rwx------+ 1 admin admin 158622 Oct 16 10:21 _3ff.cfs
> -rwx------+ 1 admin admin 158622 Oct 16 10:21 _4aa.cfs
> -rwx------+ 1 admin admin 158614 Oct 16 10:20 _uu.cfs
> -rwx------+ 1 admin admin       4 Oct 16 10:21 deletable
> -rwx------+ 1 admin admin     64 Oct 16 10:21 segments
>
> In this moment, I want to optimize() the index. I can do it safely
> without interrupting Tomcat process.
> After optimizing the index, I get a new compounf file _4ab.cfs:
>
> -rwx------+ 1 admin admin 158622 Oct 16 10:21 _1pp.cfs
> -rwx------+ 1 admin admin 158622 Oct 16 10:21 _2kk.cfs
> -rwx------+ 1 admin admin 158622 Oct 16 10:21 _3ff.cfs
> -rwx------+ 1 admin admin 158622 Oct 16 10:21 _4aa.cfs
> -rwx------+ 1 admin admin 791622 Oct 16 10:21 _4ab.cfs
> -rwx------+ 1 admin admin 158614 Oct 16 10:20 _uu.cfs
> -rwx------+ 1 admin admin     48 Oct 16 10:21 deletable
> -rwx------+ 1 admin admin     29 Oct 16 10:21 segments
>
> Now I can let threads of Tomcat know that we have a new compound
> file so that servlet can reopen IndexSearcher to use new segments.
> But I want to delete old and unnecessary files (_1pp, _2kk,
> _3ff, _4aa and _uu .cfs files) after reopening IndexSearcher
> to save disk space.
>
> How can I get a list of unnecessary files to delete them?
>
> regards,
>
> Koji
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: delete unnecessary files after optimize()

Chris Hostetter-3

: > How can I get a list of unnecessary files to delete them?
:
: I can get such information from deletable file under Win32 environment,
: correct?

I've never used Lucene on windows, but if I recall correctly from past
discussions on this topic, the IndexWriter will try to delete any file
listed in deletable whenever it does any segment merging (ie: after adding
some number of documents, when you call .optimize(), or when you call
.close().

the only reason those files won't be deleted is if some IndexReader has
them open -- in which cas you won't be able to delete them either, so
don't worry about it.  The safest thing to do is make sure you
periodically reopen new IndexReaders, and if you're really in a hurry to
get rid of those files, periodically open/close a new IndexWriter too
(even if you don't need one) ... that should cause it to try to delete the
files again.


: > -----Original Message-----
: > From: Koji Sekiguchi [mailto:[hidden email]]
: > Sent: Sunday, October 16, 2005 11:05 AM
: > To: [hidden email]
: > Subject: delete unnecessary files after optimize()
: >
: >
: > Hello,
: >
: > My Tomcat application has several threads. These threads
: > share a single instance of IndexSearcher to seach contents.
: >
: > At some point in time, I have the following index directory:
: >
: > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _1pp.cfs
: > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _2kk.cfs
: > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _3ff.cfs
: > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _4aa.cfs
: > -rwx------+ 1 admin admin 158614 Oct 16 10:20 _uu.cfs
: > -rwx------+ 1 admin admin       4 Oct 16 10:21 deletable
: > -rwx------+ 1 admin admin     64 Oct 16 10:21 segments
: >
: > In this moment, I want to optimize() the index. I can do it safely
: > without interrupting Tomcat process.
: > After optimizing the index, I get a new compounf file _4ab.cfs:
: >
: > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _1pp.cfs
: > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _2kk.cfs
: > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _3ff.cfs
: > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _4aa.cfs
: > -rwx------+ 1 admin admin 791622 Oct 16 10:21 _4ab.cfs
: > -rwx------+ 1 admin admin 158614 Oct 16 10:20 _uu.cfs
: > -rwx------+ 1 admin admin     48 Oct 16 10:21 deletable
: > -rwx------+ 1 admin admin     29 Oct 16 10:21 segments
: >
: > Now I can let threads of Tomcat know that we have a new compound
: > file so that servlet can reopen IndexSearcher to use new segments.
: > But I want to delete old and unnecessary files (_1pp, _2kk,
: > _3ff, _4aa and _uu .cfs files) after reopening IndexSearcher
: > to save disk space.
: >
: > How can I get a list of unnecessary files to delete them?
: >
: > regards,
: >
: > Koji
: >
: >
: >
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: [hidden email]
: > For additional commands, e-mail: [hidden email]
: >
: >
:
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [hidden email]
: For additional commands, e-mail: [hidden email]
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: delete unnecessary files after optimize()

Koji Sekiguchi-4
> I've never used Lucene on windows, but if I recall correctly from past
> discussions on this topic, the IndexWriter will try to delete any file
> listed in deletable whenever it does any segment merging (ie: after adding
> some number of documents, when you call .optimize(), or when you call
> .close().

You are correct.
Calling addDocument() removes unnecessary files and makes size of
deletable 4.

Thank you,

Koji

> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]]On Behalf Of Chris Hostetter
> Sent: Sunday, October 16, 2005 3:42 PM
> To: [hidden email]
> Subject: RE: delete unnecessary files after optimize()
>
>
>
> : > How can I get a list of unnecessary files to delete them?
> :
> : I can get such information from deletable file under Win32 environment,
> : correct?
>
> I've never used Lucene on windows, but if I recall correctly from past
> discussions on this topic, the IndexWriter will try to delete any file
> listed in deletable whenever it does any segment merging (ie: after adding
> some number of documents, when you call .optimize(), or when you call
> .close().
>
> the only reason those files won't be deleted is if some IndexReader has
> them open -- in which cas you won't be able to delete them either, so
> don't worry about it.  The safest thing to do is make sure you
> periodically reopen new IndexReaders, and if you're really in a hurry to
> get rid of those files, periodically open/close a new IndexWriter too
> (even if you don't need one) ... that should cause it to try to delete the
> files again.
>
>
> : > -----Original Message-----
> : > From: Koji Sekiguchi [mailto:[hidden email]]
> : > Sent: Sunday, October 16, 2005 11:05 AM
> : > To: [hidden email]
> : > Subject: delete unnecessary files after optimize()
> : >
> : >
> : > Hello,
> : >
> : > My Tomcat application has several threads. These threads
> : > share a single instance of IndexSearcher to seach contents.
> : >
> : > At some point in time, I have the following index directory:
> : >
> : > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _1pp.cfs
> : > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _2kk.cfs
> : > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _3ff.cfs
> : > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _4aa.cfs
> : > -rwx------+ 1 admin admin 158614 Oct 16 10:20 _uu.cfs
> : > -rwx------+ 1 admin admin       4 Oct 16 10:21 deletable
> : > -rwx------+ 1 admin admin     64 Oct 16 10:21 segments
> : >
> : > In this moment, I want to optimize() the index. I can do it safely
> : > without interrupting Tomcat process.
> : > After optimizing the index, I get a new compounf file _4ab.cfs:
> : >
> : > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _1pp.cfs
> : > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _2kk.cfs
> : > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _3ff.cfs
> : > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _4aa.cfs
> : > -rwx------+ 1 admin admin 791622 Oct 16 10:21 _4ab.cfs
> : > -rwx------+ 1 admin admin 158614 Oct 16 10:20 _uu.cfs
> : > -rwx------+ 1 admin admin     48 Oct 16 10:21 deletable
> : > -rwx------+ 1 admin admin     29 Oct 16 10:21 segments
> : >
> : > Now I can let threads of Tomcat know that we have a new compound
> : > file so that servlet can reopen IndexSearcher to use new segments.
> : > But I want to delete old and unnecessary files (_1pp, _2kk,
> : > _3ff, _4aa and _uu .cfs files) after reopening IndexSearcher
> : > to save disk space.
> : >
> : > How can I get a list of unnecessary files to delete them?
> : >
> : > regards,
> : >
> : > Koji
> : >
> : >
> : >
> : >
> : > ---------------------------------------------------------------------
> : > To unsubscribe, e-mail: [hidden email]
> : > For additional commands, e-mail: [hidden email]
> : >
> : >
> :
> :
> :
> : ---------------------------------------------------------------------
> : To unsubscribe, e-mail: [hidden email]
> : For additional commands, e-mail: [hidden email]
> :
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Lucene in Action : example code -> document-parsing framework ...

Patricio Galeas
Hi ALL,
I try to run the an example of the "Lucene in Action" book :

Chapter 7: Parsing Common Document Formats:
lia.handlingtypes.framework.FileIndexer

I have downloaded all the source code from www.manning.com/hatcher2
and create a java project in Lucene 3.1.

I become the following error message when the PDF document is indexed :
---------------------------------------
Indexing
E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\addressbook-entry.xml
log4j:WARN No appenders could be found for logger
(org.apache.commons.digester.Digester.sax).
log4j:WARN Please initialize the log4j system properly.
Indexing
E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\addressbook.xml
Indexing
E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\HTML.html
Indexing
E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\MSWord.doc
Indexing
E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\PDF.pdf
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/log4j/Logger
    at org.pdfbox.pdfparser.BaseParser.<clinit>(BaseParser.java:70)
    at
lia.handlingtypes.pdf.PDFBoxPDFHandler.parseDocument(PDFBoxPDFHandler.java:118)
    at
lia.handlingtypes.pdf.PDFBoxPDFHandler.getDocument(PDFBoxPDFHandler.java:32)
    at
lia.handlingtypes.framework.ExtensionFileHandler.getDocument(ExtensionFileHandler.java:39)
    at lia.handlingtypes.framework.FileIndexer.index(FileIndexer.java:43)
    at lia.handlingtypes.framework.FileIndexer.index(FileIndexer.java:36)
    at lia.handlingtypes.framework.FileIndexer.main(FileIndexer.java:77)
---------------------------------------

Have anybody some idea ??
Thank You
Patricio



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene in Action : example code -> document-parsing framework ...

msftblows
Do you have the log4j.properties file in the classpath?
 
-----Original Message-----
From: Patricio Galeas <[hidden email]>
To: [hidden email]
Sent: Mon, 17 Oct 2005 15:50:46 +0200
Subject: Lucene in Action : example code -> document-parsing framework ...


Hi ALL,
I try to run the an example of the "Lucene in Action" book :
 
Chapter 7: Parsing Common Document Formats:
lia.handlingtypes.framework.FileIndexer
 
I have downloaded all the source code from www.manning.com/hatcher2
and create a java project in Lucene 3.1.
 
I become the following error message when the PDF document is indexed :
---------------------------------------
Indexing E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\addressbook-entry.xml
log4j:WARN No appenders could be found for logger (org.apache.commons.digester.Digester.sax).
log4j:WARN Please initialize the log4j system properly.
Indexing E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\addressbook.xml
Indexing E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\HTML.html
Indexing E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\MSWord.doc
Indexing E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\PDF.pdf
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/log4j/Logger
  at org.pdfbox.pdfparser.BaseParser.<clinit>(BaseParser.java:70)
  at lia.handlingtypes.pdf.PDFBoxPDFHandler.parseDocument(PDFBoxPDFHandler.java:118)
  at lia.handlingtypes.pdf.PDFBoxPDFHandler.getDocument(PDFBoxPDFHandler.java:32)
  at lia.handlingtypes.framework.ExtensionFileHandler.getDocument(ExtensionFileHandler.java:39)
  at lia.handlingtypes.framework.FileIndexer.index(FileIndexer.java:43)
  at lia.handlingtypes.framework.FileIndexer.index(FileIndexer.java:36)
  at lia.handlingtypes.framework.FileIndexer.main(FileIndexer.java:77)
---------------------------------------
 
Have anybody some idea ??
Thank You
Patricio
 
 
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
 
Reply | Threaded
Open this post in threaded view
|

RE: Lucene in Action : example code -> document-parsing framework ...

n.bulthuis
In reply to this post by Patricio Galeas
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/log4j/Logger
  at org.pdfbox.pdfparser.BaseParser.<clinit>(BaseParser.java:70)

PDFBox cannot find Log4J. You can add Log4J to you classpath to fix
this.

-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
Sent: 17 October 2005 16:09
To: [hidden email]; [hidden email]
Subject: Re: Lucene in Action : example code -> document-parsing
framework ...


Do you have the log4j.properties file in the classpath?
 
-----Original Message-----
From: Patricio Galeas <[hidden email]>
To: [hidden email]
Sent: Mon, 17 Oct 2005 15:50:46 +0200
Subject: Lucene in Action : example code -> document-parsing framework
...


Hi ALL,
I try to run the an example of the "Lucene in Action" book :
 
Chapter 7: Parsing Common Document Formats:
lia.handlingtypes.framework.FileIndexer
 
I have downloaded all the source code from www.manning.com/hatcher2
and create a java project in Lucene 3.1.
 
I become the following error message when the PDF document is indexed :
---------------------------------------
Indexing
E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\
data\addressbook-entry.xml
log4j:WARN No appenders could be found for logger
(org.apache.commons.digester.Digester.sax).
log4j:WARN Please initialize the log4j system properly.
Indexing
E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\
data\addressbook.xml
Indexing
E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\
data\HTML.html
Indexing
E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\
data\MSWord.doc
Indexing
E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\
data\PDF.pdf
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/log4j/Logger
  at org.pdfbox.pdfparser.BaseParser.<clinit>(BaseParser.java:70)
  at
lia.handlingtypes.pdf.PDFBoxPDFHandler.parseDocument(PDFBoxPDFHandler.ja
va:118)
  at
lia.handlingtypes.pdf.PDFBoxPDFHandler.getDocument(PDFBoxPDFHandler.java
:32)
  at
lia.handlingtypes.framework.ExtensionFileHandler.getDocument(ExtensionFi
leHandler.java:39)
  at lia.handlingtypes.framework.FileIndexer.index(FileIndexer.java:43)
  at lia.handlingtypes.framework.FileIndexer.index(FileIndexer.java:36)
  at lia.handlingtypes.framework.FileIndexer.main(FileIndexer.java:77)
---------------------------------------
 
Have anybody some idea ??
Thank You
Patricio
 
 
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
 


------------------------------------------------------------------------------------------------
Disclaimer:
' Aan de inhoud van dit bericht kunnen alleen rechten ten opzichte van Interpay Nederland B.V. of aan haar gelieerde ondernemingen worden ontleend, indien zij door rechtsgeldig ondertekende stukken worden ondersteund. De informatie in dit e-mailbericht is van vertrouwelijke aard en alleen bedoeld voor gebruik door de geadresseerde. Als u een bericht onbedoeld heeft ontvangen, wordt u verzocht de verzender hiervan in kennis te stellen en het bericht te vernietigen zonder van de inhoud kennis te nemen, deze te vermenigvuldigen of andersoortig te gebruiken.'
An English version of this disclaimer is available on http://www.interpay.nl/eng/general/disclaimer.asp
------------------------------------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Lucene in Action : example code -> document-parsing framework ...

Ben Litchfield-3

In addition, the latest version(0.7.2) of PDFBox does not require log4j,
so you could also upgrade to that version.

Ben


On Mon, 17 Oct 2005 [hidden email] wrote:

> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/log4j/Logger
>   at org.pdfbox.pdfparser.BaseParser.<clinit>(BaseParser.java:70)
>
> PDFBox cannot find Log4J. You can add Log4J to you classpath to fix
> this.
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]]
> Sent: 17 October 2005 16:09
> To: [hidden email]; [hidden email]
> Subject: Re: Lucene in Action : example code -> document-parsing
> framework ...
>
>
> Do you have the log4j.properties file in the classpath?
>
> -----Original Message-----
> From: Patricio Galeas <[hidden email]>
> To: [hidden email]
> Sent: Mon, 17 Oct 2005 15:50:46 +0200
> Subject: Lucene in Action : example code -> document-parsing framework
> ...
>
>
> Hi ALL,
> I try to run the an example of the "Lucene in Action" book :
>
> Chapter 7: Parsing Common Document Formats:
> lia.handlingtypes.framework.FileIndexer
>
> I have downloaded all the source code from www.manning.com/hatcher2
> and create a java project in Lucene 3.1.
>
> I become the following error message when the PDF document is indexed :
> ---------------------------------------
> Indexing
> E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\
> data\addressbook-entry.xml
> log4j:WARN No appenders could be found for logger
> (org.apache.commons.digester.Digester.sax).
> log4j:WARN Please initialize the log4j system properly.
> Indexing
> E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\
> data\addressbook.xml
> Indexing
> E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\
> data\HTML.html
> Indexing
> E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\
> data\MSWord.doc
> Indexing
> E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\
> data\PDF.pdf
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/log4j/Logger
>   at org.pdfbox.pdfparser.BaseParser.<clinit>(BaseParser.java:70)
>   at
> lia.handlingtypes.pdf.PDFBoxPDFHandler.parseDocument(PDFBoxPDFHandler.ja
> va:118)
>   at
> lia.handlingtypes.pdf.PDFBoxPDFHandler.getDocument(PDFBoxPDFHandler.java
> :32)
>   at
> lia.handlingtypes.framework.ExtensionFileHandler.getDocument(ExtensionFi
> leHandler.java:39)
>   at lia.handlingtypes.framework.FileIndexer.index(FileIndexer.java:43)
>   at lia.handlingtypes.framework.FileIndexer.index(FileIndexer.java:36)
>   at lia.handlingtypes.framework.FileIndexer.main(FileIndexer.java:77)
> ---------------------------------------
>
> Have anybody some idea ??
> Thank You
> Patricio
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
> ------------------------------------------------------------------------------------------------
> Disclaimer:
> ' Aan de inhoud van dit bericht kunnen alleen rechten ten opzichte van Interpay Nederland B.V. of aan haar gelieerde ondernemingen worden ontleend, indien zij door rechtsgeldig ondertekende stukken worden ondersteund. De informatie in dit e-mailbericht is van vertrouwelijke aard en alleen bedoeld voor gebruik door de geadresseerde. Als u een bericht onbedoeld heeft ontvangen, wordt u verzocht de verzender hiervan in kennis te stellen en het bericht te vernietigen zonder van de inhoud kennis te nemen, deze te vermenigvuldigen of andersoortig te gebruiken.'
> An English version of this disclaimer is available on http://www.interpay.nl/eng/general/disclaimer.asp
> ------------------------------------------------------------------------------------------------
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene in Action : example code -> document-parsing framework ...

Patricio Galeas
Hello,
first, thank you for your help ....!!

I have replaced the JAR File in the "Java Build Path" von Eclipse with
the lastest version (PDFBox-0.7.2.jar), but I still receive the same
error message :

Indexing E:\Galeas\lucene\data\pdfs\Beginning Java Server Pages.pdf
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/log4j/Logger
    at org.pdfbox.pdfparser.BaseParser.<clinit>(BaseParser.java:70)
    at
org.galeas.index.handlingtypes.pdf.PDFBoxPDFHandler.parseDocument(PDFBoxPDFHandler.java:117)
    at
org.galeas.index.handlingtypes.pdf.PDFBoxPDFHandler.getDocument(PDFBoxPDFHandler.java:32)
    at
org.galeas.index.handlingtypes.framework.ExtensionFileHandler.getDocument(ExtensionFileHandler.java:38)
    at
org.galeas.index.handlingtypes.framework.FileIndexer.index(FileIndexer.java:43)
    at
org.galeas.index.handlingtypes.framework.FileIndexer.index(FileIndexer.java:36)
    at
org.galeas.index.handlingtypes.framework.FileIndexer.main(FileIndexer.java:77)

What do I wrong?
Thank You
Patricio



Ben Litchfield schrieb:

>In addition, the latest version(0.7.2) of PDFBox does not require log4j,
>so you could also upgrade to that version.
>
>Ben
>
>
>On Mon, 17 Oct 2005 [hidden email] wrote:
>
>  
>
>>Exception in thread "main" java.lang.NoClassDefFoundError:
>>org/apache/log4j/Logger
>>  at org.pdfbox.pdfparser.BaseParser.<clinit>(BaseParser.java:70)
>>
>>PDFBox cannot find Log4J. You can add Log4J to you classpath to fix
>>this.
>>
>>-----Original Message-----
>>From: [hidden email] [mailto:[hidden email]]
>>Sent: 17 October 2005 16:09
>>To: [hidden email]; [hidden email]
>>Subject: Re: Lucene in Action : example code -> document-parsing
>>framework ...
>>
>>
>>Do you have the log4j.properties file in the classpath?
>>
>>-----Original Message-----
>>From: Patricio Galeas <[hidden email]>
>>To: [hidden email]
>>Sent: Mon, 17 Oct 2005 15:50:46 +0200
>>Subject: Lucene in Action : example code -> document-parsing framework
>>...
>>
>>
>>Hi ALL,
>>I try to run the an example of the "Lucene in Action" book :
>>
>>Chapter 7: Parsing Common Document Formats:
>>lia.handlingtypes.framework.FileIndexer
>>
>>I have downloaded all the source code from www.manning.com/hatcher2
>>and create a java project in Lucene 3.1.
>>
>>I become the following error message when the PDF document is indexed :
>>---------------------------------------
>>Indexing
>>E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\
>>data\addressbook-entry.xml
>>log4j:WARN No appenders could be found for logger
>>(org.apache.commons.digester.Digester.sax).
>>log4j:WARN Please initialize the log4j system properly.
>>Indexing
>>E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\
>>data\addressbook.xml
>>Indexing
>>E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\
>>data\HTML.html
>>Indexing
>>E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\
>>data\MSWord.doc
>>Indexing
>>E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\
>>data\PDF.pdf
>>Exception in thread "main" java.lang.NoClassDefFoundError:
>>org/apache/log4j/Logger
>>  at org.pdfbox.pdfparser.BaseParser.<clinit>(BaseParser.java:70)
>>  at
>>lia.handlingtypes.pdf.PDFBoxPDFHandler.parseDocument(PDFBoxPDFHandler.ja
>>va:118)
>>  at
>>lia.handlingtypes.pdf.PDFBoxPDFHandler.getDocument(PDFBoxPDFHandler.java
>>:32)
>>  at
>>lia.handlingtypes.framework.ExtensionFileHandler.getDocument(ExtensionFi
>>leHandler.java:39)
>>  at lia.handlingtypes.framework.FileIndexer.index(FileIndexer.java:43)
>>  at lia.handlingtypes.framework.FileIndexer.index(FileIndexer.java:36)
>>  at lia.handlingtypes.framework.FileIndexer.main(FileIndexer.java:77)
>>---------------------------------------
>>
>>Have anybody some idea ??
>>Thank You
>>Patricio
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: [hidden email]
>>For additional commands, e-mail: [hidden email]
>>
>>
>>
>>------------------------------------------------------------------------------------------------
>>Disclaimer:
>>' Aan de inhoud van dit bericht kunnen alleen rechten ten opzichte van Interpay Nederland B.V. of aan haar gelieerde ondernemingen worden ontleend, indien zij door rechtsgeldig ondertekende stukken worden ondersteund. De informatie in dit e-mailbericht is van vertrouwelijke aard en alleen bedoeld voor gebruik door de geadresseerde. Als u een bericht onbedoeld heeft ontvangen, wordt u verzocht de verzender hiervan in kennis te stellen en het bericht te vernietigen zonder van de inhoud kennis te nemen, deze te vermenigvuldigen of andersoortig te gebruiken.'
>>An English version of this disclaimer is available on http://www.interpay.nl/eng/general/disclaimer.asp
>>------------------------------------------------------------------------------------------------
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: [hidden email]
>>For additional commands, e-mail: [hidden email]
>>
>>    
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
>
>  
>
I have replace the

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene in Action : example code -> document-parsing framework ...

Erik Hatcher
Please try running the examples from the command-line per the  
README.  Eliminate Eclipse from the equation first.

     Erik



On 17 Oct 2005, at 19:12, Patricio Galeas wrote:

> Hello,
> first, thank you for your help ....!!
>
> I have replaced the JAR File in the "Java Build Path" von Eclipse  
> with the lastest version (PDFBox-0.7.2.jar), but I still receive  
> the same error message :
>
> Indexing E:\Galeas\lucene\data\pdfs\Beginning Java Server Pages.pdf
> Exception in thread "main" java.lang.NoClassDefFoundError: org/
> apache/log4j/Logger
>    at org.pdfbox.pdfparser.BaseParser.<clinit>(BaseParser.java:70)
>    at  
> org.galeas.index.handlingtypes.pdf.PDFBoxPDFHandler.parseDocument
> (PDFBoxPDFHandler.java:117)
>    at  
> org.galeas.index.handlingtypes.pdf.PDFBoxPDFHandler.getDocument
> (PDFBoxPDFHandler.java:32)
>    at  
> org.galeas.index.handlingtypes.framework.ExtensionFileHandler.getDocum
> ent(ExtensionFileHandler.java:38)
>    at org.galeas.index.handlingtypes.framework.FileIndexer.index
> (FileIndexer.java:43)
>    at org.galeas.index.handlingtypes.framework.FileIndexer.index
> (FileIndexer.java:36)
>    at org.galeas.index.handlingtypes.framework.FileIndexer.main
> (FileIndexer.java:77)
>
> What do I wrong?
> Thank You
> Patricio
>
>
>
> Ben Litchfield schrieb:
>
>
>> In addition, the latest version(0.7.2) of PDFBox does not require  
>> log4j,
>> so you could also upgrade to that version.
>>
>> Ben
>>
>>
>> On Mon, 17 Oct 2005 [hidden email] wrote:
>>
>>
>>
>>> Exception in thread "main" java.lang.NoClassDefFoundError:
>>> org/apache/log4j/Logger
>>>  at org.pdfbox.pdfparser.BaseParser.<clinit>(BaseParser.java:70)
>>>
>>> PDFBox cannot find Log4J. You can add Log4J to you classpath to fix
>>> this.
>>>
>>> -----Original Message-----
>>> From: [hidden email] [mailto:[hidden email]]
>>> Sent: 17 October 2005 16:09
>>> To: [hidden email]; [hidden email]
>>> Subject: Re: Lucene in Action : example code -> document-parsing
>>> framework ...
>>>
>>>
>>> Do you have the log4j.properties file in the classpath?
>>>
>>> -----Original Message-----
>>> From: Patricio Galeas <[hidden email]>
>>> To: [hidden email]
>>> Sent: Mon, 17 Oct 2005 15:50:46 +0200
>>> Subject: Lucene in Action : example code -> document-parsing  
>>> framework
>>> ...
>>>
>>>
>>> Hi ALL,
>>> I try to run the an example of the "Lucene in Action" book :
>>>
>>> Chapter 7: Parsing Common Document Formats:
>>> lia.handlingtypes.framework.FileIndexer
>>>
>>> I have downloaded all the source code from www.manning.com/hatcher2
>>> and create a java project in Lucene 3.1.
>>>
>>> I become the following error message when the PDF document is  
>>> indexed :
>>> ---------------------------------------
>>> Indexing
>>> E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia
>>> \handlingtypes\
>>> data\addressbook-entry.xml
>>> log4j:WARN No appenders could be found for logger
>>> (org.apache.commons.digester.Digester.sax).
>>> log4j:WARN Please initialize the log4j system properly.
>>> Indexing
>>> E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia
>>> \handlingtypes\
>>> data\addressbook.xml
>>> Indexing
>>> E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia
>>> \handlingtypes\
>>> data\HTML.html
>>> Indexing
>>> E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia
>>> \handlingtypes\
>>> data\MSWord.doc
>>> Indexing
>>> E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia
>>> \handlingtypes\
>>> data\PDF.pdf
>>> Exception in thread "main" java.lang.NoClassDefFoundError:
>>> org/apache/log4j/Logger
>>>  at org.pdfbox.pdfparser.BaseParser.<clinit>(BaseParser.java:70)
>>>  at
>>> lia.handlingtypes.pdf.PDFBoxPDFHandler.parseDocument
>>> (PDFBoxPDFHandler.ja
>>> va:118)
>>>  at
>>> lia.handlingtypes.pdf.PDFBoxPDFHandler.getDocument
>>> (PDFBoxPDFHandler.java
>>> :32)
>>>  at
>>> lia.handlingtypes.framework.ExtensionFileHandler.getDocument
>>> (ExtensionFi
>>> leHandler.java:39)
>>>  at lia.handlingtypes.framework.FileIndexer.index
>>> (FileIndexer.java:43)
>>>  at lia.handlingtypes.framework.FileIndexer.index
>>> (FileIndexer.java:36)
>>>  at lia.handlingtypes.framework.FileIndexer.main(FileIndexer.java:
>>> 77)
>>> ---------------------------------------
>>>
>>> Have anybody some idea ??
>>> Thank You
>>> Patricio
>>>
>>>
>>> --------------------------------------------------------------------
>>> -
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>>
>>> --------------------------------------------------------------------
>>> ----------------------------
>>> Disclaimer:
>>> ' Aan de inhoud van dit bericht kunnen alleen rechten ten  
>>> opzichte van Interpay Nederland B.V. of aan haar gelieerde  
>>> ondernemingen worden ontleend, indien zij door rechtsgeldig  
>>> ondertekende stukken worden ondersteund. De informatie in dit e-
>>> mailbericht is van vertrouwelijke aard en alleen bedoeld voor  
>>> gebruik door de geadresseerde. Als u een bericht onbedoeld heeft  
>>> ontvangen, wordt u verzocht de verzender hiervan in kennis te  
>>> stellen en het bericht te vernietigen zonder van de inhoud kennis  
>>> te nemen, deze te vermenigvuldigen of andersoortig te gebruiken.'
>>> An English version of this disclaimer is available on http://
>>> www.interpay.nl/eng/general/disclaimer.asp
>>> --------------------------------------------------------------------
>>> ----------------------------
>>>
>>>
>>> --------------------------------------------------------------------
>>> -
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>>
>>
> I have replace the
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene in Action : example code -> document-parsing framework ...

Malcolm Clark

Hi,

Could somebody please help me regarding Lucene and Digester. I have discovered this problem during indexing the INEX collection of XML for my MSc project.

During the parsing of the XML files all named Volume.xml the parser will only index the last XML element in any repetitive list. For example:

<book>

</chapters>

<title></title>

<title></title> //Only this title element is indexed

</chapters>

</book>



</chapters>

</book>

How does one put multiple fields in one Digester field for Lucene indexing?

 

Thanks in advance.

MC