Problem with CharStream and Tokenizers with custom reset(Reader) method

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with CharStream and Tokenizers with custom reset(Reader) method

Uwe Schindler
When reviewing the new CharStream code added to Tokenizers, I found a
serious problem with backwards compatibility and other Tokenizers, that do
not override reset(CharStream).

The problem is, that e.g. CharTokenizer only overrides reset(Reader):

  public void reset(Reader input) throws IOException {
    super.reset(input);
    bufferIndex = 0;
    offset = 0;
    dataLen = 0;
  }

If you reset such a Tokenizer with another CharStream (not a Reader), this
method will never be called and breaking the whole Tokenizer.

As CharStream extends Reader, I propose to remove this reset(CharStream
method) and simply do an instanceof check to detect if the supplied Reader
is no CharStream and wrap it. We could also remove the extra ctor (because
most Tokenizers have no support for passing CharStreams). If the ctor also
checks with instanceof and warps as needed the code is backwards compatible
and we do not need to add additional ctors in subclasses.

As this instanceof check is always done in CharReader.get() why not remove
ctor(CharStream) and reset(CharStream) completely?

Any thoughts?

I would like to fix this somehow before RC4, I', sorry :(

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Problem with CharStream and Tokenizers with custom reset(Reader) method

Uwe Schindler
I tested the attached patch, all tests still compile and work as exspected
(as CharStream extends Reader).

I think I should open an issue?

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Uwe Schindler [mailto:[hidden email]]
> Sent: Thursday, September 10, 2009 4:54 PM
> To: [hidden email]
> Subject: Problem with CharStream and Tokenizers with custom reset(Reader)
> method
>
> When reviewing the new CharStream code added to Tokenizers, I found a
> serious problem with backwards compatibility and other Tokenizers, that do
> not override reset(CharStream).
>
> The problem is, that e.g. CharTokenizer only overrides reset(Reader):
>
>   public void reset(Reader input) throws IOException {
>     super.reset(input);
>     bufferIndex = 0;
>     offset = 0;
>     dataLen = 0;
>   }
>
> If you reset such a Tokenizer with another CharStream (not a Reader), this
> method will never be called and breaking the whole Tokenizer.
>
> As CharStream extends Reader, I propose to remove this reset(CharStream
> method) and simply do an instanceof check to detect if the supplied Reader
> is no CharStream and wrap it. We could also remove the extra ctor (because
> most Tokenizers have no support for passing CharStreams). If the ctor also
> checks with instanceof and warps as needed the code is backwards
> compatible
> and we do not need to add additional ctors in subclasses.
>
> As this instanceof check is always done in CharReader.get() why not remove
> ctor(CharStream) and reset(CharStream) completely?
>
> Any thoughts?
>
> I would like to fix this somehow before RC4, I', sorry :(
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [hidden email]
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Tokenizer-CharStream-fix.patch (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Problem with CharStream and Tokenizers with custom reset(Reader) method

Mark Miller-3
Yeah, lets open an issue and mark it blocker - I'll hold RC4 for it (was
just about to push it when I caught this email).

Uwe Schindler wrote:

> I tested the attached patch, all tests still compile and work as exspected
> (as CharStream extends Reader).
>
> I think I should open an issue?
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [hidden email]
>
>  
>> -----Original Message-----
>> From: Uwe Schindler [mailto:[hidden email]]
>> Sent: Thursday, September 10, 2009 4:54 PM
>> To: [hidden email]
>> Subject: Problem with CharStream and Tokenizers with custom reset(Reader)
>> method
>>
>> When reviewing the new CharStream code added to Tokenizers, I found a
>> serious problem with backwards compatibility and other Tokenizers, that do
>> not override reset(CharStream).
>>
>> The problem is, that e.g. CharTokenizer only overrides reset(Reader):
>>
>>   public void reset(Reader input) throws IOException {
>>     super.reset(input);
>>     bufferIndex = 0;
>>     offset = 0;
>>     dataLen = 0;
>>   }
>>
>> If you reset such a Tokenizer with another CharStream (not a Reader), this
>> method will never be called and breaking the whole Tokenizer.
>>
>> As CharStream extends Reader, I propose to remove this reset(CharStream
>> method) and simply do an instanceof check to detect if the supplied Reader
>> is no CharStream and wrap it. We could also remove the extra ctor (because
>> most Tokenizers have no support for passing CharStreams). If the ctor also
>> checks with instanceof and warps as needed the code is backwards
>> compatible
>> and we do not need to add additional ctors in subclasses.
>>
>> As this instanceof check is always done in CharReader.get() why not remove
>> ctor(CharStream) and reset(CharStream) completely?
>>
>> Any thoughts?
>>
>> I would like to fix this somehow before RC4, I', sorry :(
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: [hidden email]
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>    
>
>  
> ------------------------------------------------------------------------
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


--
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Problem with CharStream and Tokenizers with custom reset(Reader) method

Jason Rutherglen
In reply to this post by Uwe Schindler
I've been seeing strange behavior perhaps related to this? Where
sometimes a query is parsed and analyzed using Solr analyzers to
it's first clause fairly randomly, and other times the same
exact query is parsed and analyzed to the full correct query with all
clauses. It's so baffling I haven't really figured out an
approach to debugging it. I wonder if it's related to this
stream resetting issue.

On Thu, Sep 10, 2009 at 7:54 AM, Uwe Schindler <[hidden email]> wrote:

> When reviewing the new CharStream code added to Tokenizers, I found a
> serious problem with backwards compatibility and other Tokenizers, that do
> not override reset(CharStream).
>
> The problem is, that e.g. CharTokenizer only overrides reset(Reader):
>
>  public void reset(Reader input) throws IOException {
>    super.reset(input);
>    bufferIndex = 0;
>    offset = 0;
>    dataLen = 0;
>  }
>
> If you reset such a Tokenizer with another CharStream (not a Reader), this
> method will never be called and breaking the whole Tokenizer.
>
> As CharStream extends Reader, I propose to remove this reset(CharStream
> method) and simply do an instanceof check to detect if the supplied Reader
> is no CharStream and wrap it. We could also remove the extra ctor (because
> most Tokenizers have no support for passing CharStreams). If the ctor also
> checks with instanceof and warps as needed the code is backwards compatible
> and we do not need to add additional ctors in subclasses.
>
> As this instanceof check is always done in CharReader.get() why not remove
> ctor(CharStream) and reset(CharStream) completely?
>
> Any thoughts?
>
> I would like to fix this somehow before RC4, I', sorry :(
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [hidden email]
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Problem with CharStream and Tokenizers with custom reset(Reader) method

Uwe Schindler
I do not know, how this could affect Solr, but it could be the case.
Currently most Tokenizers do not use CharStreams at all. After committing
LUCENE-1906, I think there is also some additional work in Solr's custom
Tokenizers needed (changed the correctOffset method).

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> From: Jason Rutherglen [mailto:[hidden email]]
> Sent: Friday, September 11, 2009 12:28 AM
> To: [hidden email]
> Subject: Re: Problem with CharStream and Tokenizers with custom
> reset(Reader) method
>
> I've been seeing strange behavior perhaps related to this? Where
> sometimes a query is parsed and analyzed using Solr analyzers to
> it's first clause fairly randomly, and other times the same
> exact query is parsed and analyzed to the full correct query with all
> clauses. It's so baffling I haven't really figured out an
> approach to debugging it. I wonder if it's related to this
> stream resetting issue.
>
> On Thu, Sep 10, 2009 at 7:54 AM, Uwe Schindler <[hidden email]> wrote:
> > When reviewing the new CharStream code added to Tokenizers, I found a
> > serious problem with backwards compatibility and other Tokenizers, that
> do
> > not override reset(CharStream).
> >
> > The problem is, that e.g. CharTokenizer only overrides reset(Reader):
> >
> >  public void reset(Reader input) throws IOException {
> >    super.reset(input);
> >    bufferIndex = 0;
> >    offset = 0;
> >    dataLen = 0;
> >  }
> >
> > If you reset such a Tokenizer with another CharStream (not a Reader),
> this
> > method will never be called and breaking the whole Tokenizer.
> >
> > As CharStream extends Reader, I propose to remove this reset(CharStream
> > method) and simply do an instanceof check to detect if the supplied
> Reader
> > is no CharStream and wrap it. We could also remove the extra ctor
> (because
> > most Tokenizers have no support for passing CharStreams). If the ctor
> also
> > checks with instanceof and warps as needed the code is backwards
> compatible
> > and we do not need to add additional ctors in subclasses.
> >
> > As this instanceof check is always done in CharReader.get() why not
> remove
> > ctor(CharStream) and reset(CharStream) completely?
> >
> > Any thoughts?
> >
> > I would like to fix this somehow before RC4, I', sorry :(
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: [hidden email]
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]