Reviving a dead index

classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Reviving a dead index

Stanislav Jordanov
What might be the possible reason for an IndexReader failing to open
properly,
because it can not find a .fnm file that is expected to be there:

java.io.FileNotFoundException: E:\index4\_1j8s.fnm (The system cannot
find the file specified)
    at java.io.RandomAccessFile.open(Native Method)
    at java.io.RandomAccessFile.<init>(RandomAccessFile.java:212)
    at
org.apache.lucene.store.FSIndexInput$Descriptor.<init>(FSDirectory.java:425)
    at org.apache.lucene.store.FSIndexInput.<init>(FSDirectory.java:434)
    at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:324)
    at org.apache.lucene.index.FieldInfos.<init>(FieldInfos.java:56)
    at
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:144)
    at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:129)
    at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:110)
    at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:154)
    at org.apache.lucene.store.Lock$With.run(Lock.java:109)
    at org.apache.lucene.index.IndexReader.open(IndexReader.java:143)
    at org.apache.lucene.index.IndexReader.open(IndexReader.java:127)
    at org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:42)



The only thing that comes to my mind is that last time the indexing
process was not shut down properly.
Is there a way to revive the index or everything should be reindexed
from scratch?


Thanks,
Stanislav


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Reviving a dead index

Michael McCandless-2
Stanislav Jordanov wrote:
> What might be the possible reason for an IndexReader failing to open
> properly,
> because it can not find a .fnm file that is expected to be there:

This means the segments files is referencing a segment named _1j8s and
in trying to load that segment, the first thing Lucene does is load the
"field infos" (_1j8s.fnm).  It tries to do so from a compound file (if
you have it turned on & it exists), else from the filesystem directly.

Which version of Lucene are you using?  And which OS are you running on?

Is this error easily repeated (not a transient error)?  Ie,
instantiating an IndexSearcher against your index always causes this
exception?  Because, this sort of exception is certainly possible when
Lucene's locking is not working correctly (for exmple over NFS), but in
that case it's typically very intermittant.

Could you send a list of the files in your index?

> The only thing that comes to my mind is that last time the indexing
> process was not shut down properly.
> Is there a way to revive the index or everything should be reindexed
> from scratch?

Hmmm.  It's surprising that an improper shutdown caused this because
when the IndexWriter commits its change, it first writes all files for
the new segment and only when that's successful does it write a new
segments file referencing the newly written segment.  Could you provide
some more detail about your setup and how the improper shutdown happened?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Installing a custom tokenizer

Bill Taylor-2
In reply to this post by Stanislav Jordanov
I am indexing documents which are filled with government jargon.  As
one would expect, the standard tokenizer has problems with
governmenteese.

In particular, the documents use words such as 310N-P-Q as references
to other documents.  The standard tokenizer breaks this "word" at the
dashes so that I can find P or Q but not the entire token.

I know how to write a new tokenizer.  I would like hints on how to
install it and get my indexing system to use it.  I don't want to
modify the standard .jar file.  What I think I want to do is set up my
indexing operation to use the WhitespaceTokenizer instead of the normal
one, but I am unsure how to do this.

I know that the IndexTask has a setAnalyzer method.  The document
formats are rather complicated and I need special code to isolate the
text strings which should be indexed.   My file analyzer isolates the
string I want to index, then does

doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
Field.Store.YES, Field.index.TOKENIZED));

I suspect that my issue is getting the Field constructor to use a
different tokenizer.  Can anyone help?

Thanks.

Bill Taylor


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Installing a custom tokenizer

Krovi, DVSR_Sarma
> I suspect that my issue is getting the Field constructor to use a
> different tokenizer.  Can anyone help?

You need to basically come up with your own Tokenizer (You can always
write a corresponding JavaCC grammar and compiling it would give the
Tokenizer)
Then you need to extend org.apache.lucene.analysis.Analyzer class and
override the tokenStream() method. Now, wherever you are
indexing/searching, use the object of this CustomAnalyzer.
Public class MyAnalyzer extended Analyzer
{
        public TokenStream tokenStream(....)
        {
                TokenStream ts = null;
                ts = new MyTokenizer(reader);
                /* Pass this tokenstream through other filters you are
interested in */
        }
}

Krovi.

-----Original Message-----
From: Bill Taylor [mailto:[hidden email]]
Sent: Tuesday, August 29, 2006 8:10 PM
To: [hidden email]
Subject: Installing a custom tokenizer

I am indexing documents which are filled with government jargon.  As
one would expect, the standard tokenizer has problems with
governmenteese.

In particular, the documents use words such as 310N-P-Q as references
to other documents.  The standard tokenizer breaks this "word" at the
dashes so that I can find P or Q but not the entire token.

I know how to write a new tokenizer.  I would like hints on how to
install it and get my indexing system to use it.  I don't want to
modify the standard .jar file.  What I think I want to do is set up my
indexing operation to use the WhitespaceTokenizer instead of the normal
one, but I am unsure how to do this.

I know that the IndexTask has a setAnalyzer method.  The document
formats are rather complicated and I need special code to isolate the
text strings which should be indexed.   My file analyzer isolates the
string I want to index, then does

doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
Field.Store.YES, Field.index.TOKENIZED));

I suspect that my issue is getting the Field constructor to use a
different tokenizer.  Can anyone help?

Thanks.

Bill Taylor


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Installing a custom tokenizer

Bill Taylor-2
is there some way to get the standard Field constructor to use, say,
the Whitespace Tokenizer as opposed to the standard tokenizer?

On Aug 29, 2006, at 10:50 AM, Krovi, DVSR_Sarma wrote:

>> I suspect that my issue is getting the Field constructor to use a
>> different tokenizer.  Can anyone help?
>
> You need to basically come up with your own Tokenizer (You can always
> write a corresponding JavaCC grammar and compiling it would give the
> Tokenizer)
> Then you need to extend org.apache.lucene.analysis.Analyzer class and
> override the tokenStream() method. Now, wherever you are
> indexing/searching, use the object of this CustomAnalyzer.
> Public class MyAnalyzer extended Analyzer
> {
> public TokenStream tokenStream(....)
> {
> TokenStream ts = null;
> ts = new MyTokenizer(reader);
> /* Pass this tokenstream through other filters you are
> interested in */
> }
> }
>
> Krovi.
>
> -----Original Message-----
> From: Bill Taylor [mailto:[hidden email]]
> Sent: Tuesday, August 29, 2006 8:10 PM
> To: [hidden email]
> Subject: Installing a custom tokenizer
>
> I am indexing documents which are filled with government jargon.  As
> one would expect, the standard tokenizer has problems with
> governmenteese.
>
> In particular, the documents use words such as 310N-P-Q as references
> to other documents.  The standard tokenizer breaks this "word" at the
> dashes so that I can find P or Q but not the entire token.
>
> I know how to write a new tokenizer.  I would like hints on how to
> install it and get my indexing system to use it.  I don't want to
> modify the standard .jar file.  What I think I want to do is set up my
> indexing operation to use the WhitespaceTokenizer instead of the normal
> one, but I am unsure how to do this.
>
> I know that the IndexTask has a setAnalyzer method.  The document
> formats are rather complicated and I need special code to isolate the
> text strings which should be indexed.   My file analyzer isolates the
> string I want to index, then does
>
> doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
> Field.Store.YES, Field.index.TOKENIZED));
>
> I suspect that my issue is getting the Field constructor to use a
> different tokenizer.  Can anyone help?
>
> Thanks.
>
> Bill Taylor
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Installing a custom tokenizer

Ronnie Kolehmainen
Have a look at PerFieldAnalyzerWrapper:


http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html


Citerar Bill Taylor <[hidden email]>:

> is there some way to get the standard Field constructor to use, say,
> the Whitespace Tokenizer as opposed to the standard tokenizer?
>
> On Aug 29, 2006, at 10:50 AM, Krovi, DVSR_Sarma wrote:
>
> >> I suspect that my issue is getting the Field constructor to use a
> >> different tokenizer.  Can anyone help?
> >
> > You need to basically come up with your own Tokenizer (You can always
> > write a corresponding JavaCC grammar and compiling it would give the
> > Tokenizer)
> > Then you need to extend org.apache.lucene.analysis.Analyzer class and
> > override the tokenStream() method. Now, wherever you are
> > indexing/searching, use the object of this CustomAnalyzer.
> > Public class MyAnalyzer extended Analyzer
> > {
> > public TokenStream tokenStream(....)
> > {
> > TokenStream ts = null;
> > ts = new MyTokenizer(reader);
> > /* Pass this tokenstream through other filters you are
> > interested in */
> > }
> > }
> >
> > Krovi.
> >
> > -----Original Message-----
> > From: Bill Taylor [mailto:[hidden email]]
> > Sent: Tuesday, August 29, 2006 8:10 PM
> > To: [hidden email]
> > Subject: Installing a custom tokenizer
> >
> > I am indexing documents which are filled with government jargon.  As
> > one would expect, the standard tokenizer has problems with
> > governmenteese.
> >
> > In particular, the documents use words such as 310N-P-Q as references
> > to other documents.  The standard tokenizer breaks this "word" at the
> > dashes so that I can find P or Q but not the entire token.
> >
> > I know how to write a new tokenizer.  I would like hints on how to
> > install it and get my indexing system to use it.  I don't want to
> > modify the standard .jar file.  What I think I want to do is set up my
> > indexing operation to use the WhitespaceTokenizer instead of the normal
> > one, but I am unsure how to do this.
> >
> > I know that the IndexTask has a setAnalyzer method.  The document
> > formats are rather complicated and I need special code to isolate the
> > text strings which should be indexed.   My file analyzer isolates the
> > string I want to index, then does
> >
> > doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
> > Field.Store.YES, Field.index.TOKENIZED));
> >
> > I suspect that my issue is getting the Field constructor to use a
> > different tokenizer.  Can anyone help?
> >
> > Thanks.
> >
> > Bill Taylor
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Installing a custom tokenizer

Erick Erickson
In reply to this post by Bill Taylor-2
I'm in a real rush here, so pardon my brevity, but..... one of the
constructors for IndexWriter takes an Analyzer as a parameter, which can be
a PerFieldAnalyzerWrapper. That, if I understand your issue, should fix you
right up.

Same kind of thing for a Query.

Erick

On 8/29/06, Bill Taylor <[hidden email]> wrote:

>
> I am indexing documents which are filled with government jargon.  As
> one would expect, the standard tokenizer has problems with
> governmenteese.
>
> In particular, the documents use words such as 310N-P-Q as references
> to other documents.  The standard tokenizer breaks this "word" at the
> dashes so that I can find P or Q but not the entire token.
>
> I know how to write a new tokenizer.  I would like hints on how to
> install it and get my indexing system to use it.  I don't want to
> modify the standard .jar file.  What I think I want to do is set up my
> indexing operation to use the WhitespaceTokenizer instead of the normal
> one, but I am unsure how to do this.
>
> I know that the IndexTask has a setAnalyzer method.  The document
> formats are rather complicated and I need special code to isolate the
> text strings which should be indexed.   My file analyzer isolates the
> string I want to index, then does
>
> doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
> Field.Store.YES, Field.index.TOKENIZED));
>
> I suspect that my issue is getting the Field constructor to use a
> different tokenizer.  Can anyone help?
>
> Thanks.
>
> Bill Taylor
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Installing a custom tokenizer

Bill Taylor-2

On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:

> I'm in a real rush here, so pardon my brevity, but..... one of the
> constructors for IndexWriter takes an Analyzer as a parameter, which
> can be
> a PerFieldAnalyzerWrapper. That, if I understand your issue, should
> fix you
> right up.

that almost worked.  I can't use a per Field analyzer because I have to
process the content fields of all documents.  I built a custom analyzer
which extended the Standard Analyzer and replaced the tokenStream
method with a new one which used WhitespaceTokenizer instead of
StandardTokenizer.  This meant that my document IDs were not split, but
I lost the conversion of acronyms such as w.o. to wo and the like

So what I need to do is to make a new Tokenizer based on the
StandardTokenizer except that a NUM on line 83 of StandardTokenizer.jj
should be

| NUM: (<ALPHANUM> (<P> <ALPHANUM>) +  | <ALPHANUM>) >

so that a serial number need not have a digit in every other segment
and a series of letters and digits without special characters such as a
dash will be treated as a single word.

Questions:

1) If I change the .jj file in this way, how to I run javaCC to make a
new tokenizer?  The JavaCC documentation says that JavaCC generates a
number of output files; I think that I only need the tokenizer code.

2) I suppose i have to tell the query parser to parse queries in the
same way, is that right?

The reason I think so is that Luke says I have words such as w.o. in
the index which the query parser can't find.  I suspect I have to use
the same Analyzer on both, right?

> On 8/29/06, Bill Taylor <[hidden email]> wrote:
>>
>> I am indexing documents which are filled with government jargon.  As
>> one would expect, the standard tokenizer has problems with
>> governmenteese.
>>
>> In particular, the documents use words such as 310N-P-Q as references
>> to other documents.  The standard tokenizer breaks this "word" at the
>> dashes so that I can find P or Q but not the entire token.
>>
>> I know how to write a new tokenizer.  I would like hints on how to
>> install it and get my indexing system to use it.  I don't want to
>> modify the standard .jar file.  What I think I want to do is set up my
>> indexing operation to use the WhitespaceTokenizer instead of the
>> normal
>> one, but I am unsure how to do this.
>>
>> I know that the IndexTask has a setAnalyzer method.  The document
>> formats are rather complicated and I need special code to isolate the
>> text strings which should be indexed.   My file analyzer isolates the
>> string I want to index, then does
>>
>> doc.add(new Field(DocFormatters.CONTENT_FIELD, <string from the file>,
>> Field.Store.YES, Field.index.TOKENIZED));
>>
>> I suspect that my issue is getting the Field constructor to use a
>> different tokenizer.  Can anyone help?
>>
>> Thanks.
>>
>> Bill Taylor
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Installing a custom tokenizer

Chris Hostetter-3
In reply to this post by Ronnie Kolehmainen

: Have a look at PerFieldAnalyzerWrapper:

: http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html

...which can be specified in the constructors for IndexWriter and
QueryParser.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Straight TF-IDF cosine similarity?

Winton Davies-2
In reply to this post by Bill Taylor-2
Hi All,

I'm scratching my head - can someone tell me which class implements
an efficient multiple term TF.IDF Cosine similarity scoring mechanism?

There is clearly the single TermScorer - but I can't find the class
that would do a bucketed TF.IDF cosine - i.e. fill an accumulator
with the tf.idf^2 for each of the term posting lists, until
accumulator is full, and then compute the final score.

I don't need a Boolean Query - at least this seems like overkill.

Cheers,
  Winton

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Installing a custom tokenizer

Bill Taylor-2
In reply to this post by Chris Hostetter-3

On Aug 29, 2006, at 2:47 PM, Chris Hostetter wrote:

>
> : Have a look at PerFieldAnalyzerWrapper:
>
> :  
> http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/ 
> PerFieldAnalyzerWrapper.html
>
> ...which can be specified in the constructors for IndexWriter and
> QueryParser.

As I understand it, this allows me to specify a different analyzer for  
each field name.  My problem is that the standard analyzer will not  
work for my content field and I need to define a new one.  I need to  
make a modification to the StandardTokenizer so that a number does not  
need to have a digit in every other segment of a part number.

For example, the StandardTokenizer breaks aa-bb-2 on the - between aa  
and bb because it demands that every other string between a - have a  
digit.

I need to modify the .jj file for the Standard Tokenizer and get a new  
one, but I am confused by the javaCC documentation and do not know how  
to run it to get what I need.

Thanks for the help.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Installing a custom tokenizer

Mark Miller-3
In reply to this post by Bill Taylor-2
Bill Taylor wrote:

>
> On Aug 29, 2006, at 1:46 PM, Erick Erickson wrote:
>
>> I'm in a real rush here, so pardon my brevity, but..... one of the
>> constructors for IndexWriter takes an Analyzer as a parameter, which
>> can be
>> a PerFieldAnalyzerWrapper. That, if I understand your issue, should
>> fix you
>> right up.
>
> that almost worked.  I can't use a per Field analyzer because I have
> to process the content fields of all documents.  I built a custom
> analyzer which extended the Standard Analyzer and replaced the
> tokenStream method with a new one which used WhitespaceTokenizer
> instead of StandardTokenizer.  This meant that my document IDs were
> not split, but I lost the conversion of acronyms such as w.o. to wo
> and the like
>
> So what I need to do is to make a new Tokenizer based on the
> StandardTokenizer except that a NUM on line 83 of StandardTokenizer.jj
> should be
>
> | NUM: (<ALPHANUM> (<P> <ALPHANUM>) +  | <ALPHANUM>) >
>
> so that a serial number need not have a digit in every other segment
> and a series of letters and digits without special characters such as
> a dash will be treated as a single word.
>
> Questions:
>
> 1) If I change the .jj file in this way, how to I run javaCC to make a
> new tokenizer?  The JavaCC documentation says that JavaCC generates a
> number of output files; I think that I only need the tokenizer code.
>
> 2) I suppose i have to tell the query parser to parse queries in the
> same way, is that right?
>
> The reason I think so is that Luke says I have words such as w.o. in
> the index which the query parser can't find.  I suspect I have to use
> the same Analyzer on both, right?
>
Get JavaCC and run it on StandardTokenizer.jj. This should be as simple
as typing 'JavaCC StandardTokenizer.jj'...I believe with no output
folder specified all of the files will be built in the current
directory. Don't worry about not generating the ones you do not
need--JavaCC will handle everything for you. If you use Eclipse I
recommend the JavaCC plug-in. I find it very handy.

Generally you must run the same analyzer that you indexed with on your
search strings...if the standard analyzer parses oldman-83 to oldman
while indexing and you use whitespace analyzer while searching then you
will attempt to find oldman-83 in the index instead of oldman (which was
what standard analyzer stored).

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Installing a custom tokenizer

Erick Erickson
In reply to this post by Bill Taylor-2
Tucked away in the contrib section of  Lucene (I'm using 2.0) there is....

org.apache.lucene.index.memory.PatternAnalyzer

which takes a regular expression as and tokenizes with it. Would that help?
Word of warning... the regex determines what is NOT a token, not what IS a
token (as I remember), which threw me for a bit.

Don't know if this is really useful, but it might work for you without as
much work...

Best
Erick@I'mNowBeyondMyCompetence.WhyDoTheyStillEmployMeHere?

On 8/29/06, Bill Taylor <[hidden email]> wrote:

>
>
> On Aug 29, 2006, at 2:47 PM, Chris Hostetter wrote:
>
> >
> > : Have a look at PerFieldAnalyzerWrapper:
> >
> > :
> > http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/
> > PerFieldAnalyzerWrapper.html
> >
> > ...which can be specified in the constructors for IndexWriter and
> > QueryParser.
>
> As I understand it, this allows me to specify a different analyzer for
> each field name.  My problem is that the standard analyzer will not
> work for my content field and I need to define a new one.  I need to
> make a modification to the StandardTokenizer so that a number does not
> need to have a digit in every other segment of a part number.
>
> For example, the StandardTokenizer breaks aa-bb-2 on the - between aa
> and bb because it demands that every other string between a - have a
> digit.
>
> I need to modify the .jj file for the Standard Tokenizer and get a new
> one, but I am confused by the javaCC documentation and do not know how
> to run it to get what I need.
>
> Thanks for the help.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Creating a new index from an existing index

Xiaocheng Luan
Hi,
Got a question. Here is what I want to achieve:

Create a new index from an existing index, to change the boosting factor for some of the documents (and potentially some other tweaks), without reindexing it from the source.

Is there any tools or ways to do this?
Thanks!
Xiaocheng Luan

 
---------------------------------
Get your own web address for just $1.99/1st yr. We'll help. Yahoo! Small Business.
Reply | Threaded
Open this post in threaded view
|

Re: Installing a custom tokenizer

Bill Taylor-2
In reply to this post by Erick Erickson
I have copied Lucene's StandardTokenizer.jj into my directory, renamed
it, and did a global change of the names to my class name,
LogTokenizer.

The issue is that the generated LogTokenizer.java does not compile for
2 reasons:

1) in the constructor, this(new FastCharStream(reader)); fails because
there is no such constructor in the parent class.  I commented it out.

2) I get an error on the next() method which throws ParseException and
IO Exception.  The message is Exception ParseException is not
compatible with throws clause in TokenStream.next().  As far as I can
see, the exceptions are OK.

Since all of this is generated code, my feelings are a bit hurt.  Did
Lucene use an older version of JavaCC?  I am using javacc-4.0

On Aug 29, 2006, at 4:57 PM, Erick Erickson wrote:

> Tucked away in the contrib section of  Lucene (I'm using 2.0) there
> is....
>
> org.apache.lucene.index.memory.PatternAnalyzer
>
> which takes a regular expression as and tokenizes with it. Would that
> help?
> Word of warning... the regex determines what is NOT a token, not what
> IS a
> token (as I remember), which threw me for a bit.
>
> Don't know if this is really useful, but it might work for you without
> as
> much work...
>
> Best
> Erick@I'mNowBeyondMyCompetence.WhyDoTheyStillEmployMeHere?
>
> On 8/29/06, Bill Taylor <[hidden email]> wrote:
>>
>>
>> On Aug 29, 2006, at 2:47 PM, Chris Hostetter wrote:
>>
>> >
>> > : Have a look at PerFieldAnalyzerWrapper:
>> >
>> > :
>> > http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/
>> > PerFieldAnalyzerWrapper.html
>> >
>> > ...which can be specified in the constructors for IndexWriter and
>> > QueryParser.
>>
>> As I understand it, this allows me to specify a different analyzer for
>> each field name.  My problem is that the standard analyzer will not
>> work for my content field and I need to define a new one.  I need to
>> make a modification to the StandardTokenizer so that a number does not
>> need to have a digit in every other segment of a part number.
>>
>> For example, the StandardTokenizer breaks aa-bb-2 on the - between aa
>> and bb because it demands that every other string between a - have a
>> digit.
>>
>> I need to modify the .jj file for the Standard Tokenizer and get a new
>> one, but I am confused by the javaCC documentation and do not know how
>> to run it to get what I need.
>>
>> Thanks for the help.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Installing a custom tokenizer

Mark Miller-3
Bill Taylor wrote:

> I have copied Lucene's StandardTokenizer.jj into my directory, renamed
> it, and did a global change of the names to my class name, LogTokenizer.
>
> The issue is that the generated LogTokenizer.java does not compile for
> 2 reasons:
>
> 1) in the constructor, this(new FastCharStream(reader)); fails because
> there is no such constructor in the parent class.  I commented it out.
>
> 2) I get an error on the next() method which throws ParseException and
> IO Exception.  The message is Exception ParseException is not
> compatible with throws clause in TokenStream.next().  As far as I can
> see, the exceptions are OK.
>
> Since all of this is generated code, my feelings are a bit hurt.  Did
> Lucene use an older version of JavaCC?  I am using javacc-4.0
>
> On Aug 29, 2006, at 4:57 PM, Erick Erickson wrote:
>
>
>
1. I would need to know more, but it sounds like you should uncomment it :)

2. JavaCC 4.0 is fine.  If you look at the Lucene build.xml you will see
that  it compile the javacc generated classes in a separate directory so
that it can exclude the ParseException class from compilation. Check out
the build file and copy that method.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Installing a custom tokenizer

Mark Miller-3
In reply to this post by Bill Taylor-2
Bill Taylor wrote:

> I have copied Lucene's StandardTokenizer.jj into my directory, renamed
> it, and did a global change of the names to my class name, LogTokenizer.
>
> The issue is that the generated LogTokenizer.java does not compile for
> 2 reasons:
>
> 1) in the constructor, this(new FastCharStream(reader)); fails because
> there is no such constructor in the parent class.  I commented it out.
>
> 2) I get an error on the next() method which throws ParseException and
> IO Exception.  The message is Exception ParseException is not
> compatible with throws clause in TokenStream.next().  As far as I can
> see, the exceptions are OK.
>
> Since all of this is generated code, my feelings are a bit hurt.  Did
> Lucene use an older version of JavaCC?  I am using javacc-4.0
>
> On Aug 29, 2006, at 4:57 PM, Erick Erickson wrote:
>
>
Ok. How about some better answers:

1. this(new FastCharStream(reader)) does not refer to the parent class
but the LogTokenizer class itself. There should be a constructor on the
LogTokenizer that takes a CharStream:

  public StandardTokenizer(CharStream stream) {
    token_source = new StandardTokenizerTokenManager(stream);
    token = new Token();
    jj_ntk = -1;
    jj_gen = 0;
    for (int i = 0; i < 1; i++) jj_la1[i] = -1;
  }

If you're LogTokenizer does not have this then something did not go right.

2. The ParseException that is generated when making the StandardAnalyzer
must be killed because there is another ParseException class (maybe in
queryparser?) that must be used instead. The lucene build file excludes
the StandardAnalyzer ParseException so that the other one is used. You
could prob just delete it as well but then of course you would have to
remember to delete it every time you rebuilt the javacc file.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Creating a new index from an existing index

Erick Erickson
In reply to this post by Xiaocheng Luan
A couple of things..

1> I don't think you set the boost when indexing. You set the boost when
querying, so you don't need to re-index for boosting.

2> A recurring theme is that you can't do an update-in-place for a lucene
document. You might search the mail archive for a discussion of this. The
short form is that if you want to change every document, you're probably
better off re-indexing the whole thing. If, for some reason you can't/don't
want to just re-index it all, then be aware that if you didn't store the
fields for the documents (i.e. use Field.Store.YES), then you really can't
reconstruct the document from the index without potentially losing
information.

Hope this helps
Erick

On 8/29/06, Xiaocheng Luan <[hidden email]> wrote:

>
> Hi,
> Got a question. Here is what I want to achieve:
>
> Create a new index from an existing index, to change the boosting factor
> for some of the documents (and potentially some other tweaks), without
> reindexing it from the source.
>
> Is there any tools or ways to do this?
> Thanks!
> Xiaocheng Luan
>
>
> ---------------------------------
> Get your own web address for just $1.99/1st yr. We'll help. Yahoo! Small
> Business.
>
Reply | Threaded
Open this post in threaded view
|

Re: Straight TF-IDF cosine similarity?

SnowCrash
In reply to this post by Bill Taylor-2
Have you looked at the MoreLikeThis class in the similarity package?

On 8/30/06, Winton Davies <[hidden email]> wrote:

>
> Hi All,
>
> I'm scratching my head - can someone tell me which class implements
> an efficient multiple term TF.IDF Cosine similarity scoring mechanism?
>
> There is clearly the single TermScorer - but I can't find the class
> that would do a bucketed TF.IDF cosine - i.e. fill an accumulator
> with the tf.idf^2 for each of the term posting lists, until
> accumulator is full, and then compute the final score.
>
> I don't need a Boolean Query - at least this seems like overkill.
>
> Cheers,
>   Winton
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Installing a custom tokenizer

Bill Taylor-2
In reply to this post by Mark Miller-3

On Aug 29, 2006, at 7:12 PM, Mark Miller wrote:

>
> 2. The ParseException that is generated when making the
> StandardAnalyzer must be killed because there is another
> ParseException class (maybe in queryparser?) that must be used
> instead. The lucene build file excludes the StandardAnalyzer
> ParseException so that the other one is used. You could prob just
> delete it as well but then of course you would have to remember to
> delete it every time you rebuilt the javacc file.
>

If I use the generated parse exception, I get the message that the
ParseException which is thrown by next() is incompatible with the
throws statement in the generated code.  If I get rid of it and use the
older one, it turns out that generateException cannot generate a
ParseException because it passes a generated Token instead of the
Lucene Token.

It appears that the JavaCC I am using is generating code which is a bit
ahead of Lucene.


So I simply generated a new token of hte roper Lucene type for the
error message and everything compiled.  Now I have to tell the query
parser to use the same analyzer and all will be well.

Thanks.

The good news is that Lucene is unbelievably wonderful.  The bad news
is that Lucene is unbelievably wonderful.

Thanks again.



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12