mixing dv types in one IW session

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

mixing dv types in one IW session

Robert Muir
Hello,

Just doing some playing around, i wanted to see what happens if you
changeup a docvalues type across different documents in a single IW
session, e.g.

case 1:
doc1.add(new IntDocValuesField("foo", 5))
doc2.add(new FloatDocValuesField("foo", 2.5f))

in this case the 2.5f is truncated to an int and becomes a 2

case 2:
doc3.add(new StraightBytesDocValuesField("foo", new BytesRef("boo!"))

in this case you hit an NPE in IntsWriter, because the straightbytes
impl naturally cannot return an intvalue.

So I'm wondering what we should do?
Currently both merging and multidocvalues do a type-promotion, but if
it happens in the same iw session this won't happen.

idea 1: throw an exception if the type is changed in one session. this
leaves things a little inconsistent, but prevents strange results.
idea 2: throw an exception if the type is changed *and also on
merge/multidocvalues*. This seems a little cruel (no way to upgrade
your short to int if you need later) but would simplify some code.
(evil) idea 3: force a flush if the type is changed and let merging
take care of it.
idea 4: buffer docvalues in ram in IW instead of inside the codec, in
a "type-independent way" (e.g. sorted hash of the unique byte values +
per-doc ords). this is a lot of work, but would make the codec side of
DV simpler as it just does encode/decode and wouldnt have to do ram
accounting or deal with types changing or any of that.

any other ideas?


--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: mixing dv types in one IW session

Michael McCandless-2
I agree this inconsistency is bad... and silently losing stuff (float
2.5 becomes int 2) is really bad.  We should do something before 4.0.

I would prefer idea 2, i.e. that we never allow changing/promoting a DV
type for a given field, and that we do our best to throw clear exc if you
do so. I realize this is different from other things in Lucene where "anything
goes" but DV is new in 4.0 so we are free to set new rules.

Also, if this somehow later proves to be a bad decision, we can always
add back in this leniency ... but not vice-versa.

Mike McCandless

http://blog.mikemccandless.com

On Mon, May 28, 2012 at 3:53 PM, Robert Muir <[hidden email]> wrote:

> Hello,
>
> Just doing some playing around, i wanted to see what happens if you
> changeup a docvalues type across different documents in a single IW
> session, e.g.
>
> case 1:
> doc1.add(new IntDocValuesField("foo", 5))
> doc2.add(new FloatDocValuesField("foo", 2.5f))
>
> in this case the 2.5f is truncated to an int and becomes a 2
>
> case 2:
> doc3.add(new StraightBytesDocValuesField("foo", new BytesRef("boo!"))
>
> in this case you hit an NPE in IntsWriter, because the straightbytes
> impl naturally cannot return an intvalue.
>
> So I'm wondering what we should do?
> Currently both merging and multidocvalues do a type-promotion, but if
> it happens in the same iw session this won't happen.
>
> idea 1: throw an exception if the type is changed in one session. this
> leaves things a little inconsistent, but prevents strange results.
> idea 2: throw an exception if the type is changed *and also on
> merge/multidocvalues*. This seems a little cruel (no way to upgrade
> your short to int if you need later) but would simplify some code.
> (evil) idea 3: force a flush if the type is changed and let merging
> take care of it.
> idea 4: buffer docvalues in ram in IW instead of inside the codec, in
> a "type-independent way" (e.g. sorted hash of the unique byte values +
> per-doc ords). this is a lot of work, but would make the codec side of
> DV simpler as it just does encode/decode and wouldnt have to do ram
> accounting or deal with types changing or any of that.
>
> any other ideas?
>
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: mixing dv types in one IW session

Robert Muir
On Tue, May 29, 2012 at 12:20 PM, Michael McCandless
<[hidden email]> wrote:

> I agree this inconsistency is bad... and silently losing stuff (float
> 2.5 becomes int 2) is really bad.  We should do something before 4.0.
>
> I would prefer idea 2, i.e. that we never allow changing/promoting a DV
> type for a given field, and that we do our best to throw clear exc if you
> do so. I realize this is different from other things in Lucene where "anything
> goes" but DV is new in 4.0 so we are free to set new rules.
>
> Also, if this somehow later proves to be a bad decision, we can always
> add back in this leniency ... but not vice-versa.

Right, this would certainly simplify things: but as I mentioned its a
little cruel if someone is using a 16-bit type (for DV or norms) and
decides they are running out of space and need 32-bits or something.

Maybe i'm worrying about it too much?

One idea would be to move this type promotion to a
FilterIndexReader+AddIndexes tool in contrib, e.g. a general tool that
can upwards cast a norm or dv type?

Ive thought about this before: maybe having a nice tool to change
these types of things, e.g. completely remove a field, or unomit norms
(but you must specify default value), or other crazy things like that?

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: mixing dv types in one IW session

Michael McCandless-2
On Tue, May 29, 2012 at 12:31 PM, Robert Muir <[hidden email]> wrote:

> On Tue, May 29, 2012 at 12:20 PM, Michael McCandless
> <[hidden email]> wrote:
>> I agree this inconsistency is bad... and silently losing stuff (float
>> 2.5 becomes int 2) is really bad.  We should do something before 4.0.
>>
>> I would prefer idea 2, i.e. that we never allow changing/promoting a DV
>> type for a given field, and that we do our best to throw clear exc if you
>> do so. I realize this is different from other things in Lucene where "anything
>> goes" but DV is new in 4.0 so we are free to set new rules.
>>
>> Also, if this somehow later proves to be a bad decision, we can always
>> add back in this leniency ... but not vice-versa.
>
> Right, this would certainly simplify things: but as I mentioned its a
> little cruel if someone is using a 16-bit type (for DV or norms) and
> decides they are running out of space and need 32-bits or something.
>
> Maybe i'm worrying about it too much?

I think we can wait and see how many users complain about it ... I
suspect users that change up the bit width of their norms are rather
advanced and can handle re-indexing.

> One idea would be to move this type promotion to a
> FilterIndexReader+AddIndexes tool in contrib, e.g. a general tool that
> can upwards cast a norm or dv type?
>
> Ive thought about this before: maybe having a nice tool to change
> these types of things, e.g. completely remove a field, or unomit norms
> (but you must specify default value), or other crazy things like that?

I like that idea!  This way there's an "out" for such users...

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: mixing dv types in one IW session

Simon Willnauer
On Tue, May 29, 2012 at 6:38 PM, Michael McCandless
<[hidden email]> wrote:

> On Tue, May 29, 2012 at 12:31 PM, Robert Muir <[hidden email]> wrote:
>> On Tue, May 29, 2012 at 12:20 PM, Michael McCandless
>> <[hidden email]> wrote:
>>> I agree this inconsistency is bad... and silently losing stuff (float
>>> 2.5 becomes int 2) is really bad.  We should do something before 4.0.
>>>
>>> I would prefer idea 2, i.e. that we never allow changing/promoting a DV
>>> type for a given field, and that we do our best to throw clear exc if you
>>> do so. I realize this is different from other things in Lucene where "anything
>>> goes" but DV is new in 4.0 so we are free to set new rules.

+1 I think we can easily build a "global" view of DV types like we do
for field numbers and be consistent across DWPT

>>>
>>> Also, if this somehow later proves to be a bad decision, we can always
>>> add back in this leniency ... but not vice-versa.
>>
>> Right, this would certainly simplify things: but as I mentioned its a
>> little cruel if someone is using a 16-bit type (for DV or norms) and
>> decides they are running out of space and need 32-bits or something.
>>
>> Maybe i'm worrying about it too much?
>
> I think we can wait and see how many users complain about it ... I
> suspect users that change up the bit width of their norms are rather
> advanced and can handle re-indexing.
>
>> One idea would be to move this type promotion to a
>> FilterIndexReader+AddIndexes tool in contrib, e.g. a general tool that
>> can upwards cast a norm or dv type?
>>
>> Ive thought about this before: maybe having a nice tool to change
>> these types of things, e.g. completely remove a field, or unomit norms
>> (but you must specify default value), or other crazy things like that?
>
> I like that idea!  This way there's an "out" for such users...
+1 as well

simon
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: mixing dv types in one IW session

Robert Muir
On Tue, May 29, 2012 at 2:29 PM, Simon Willnauer
<[hidden email]> wrote:
> +1 I think we can easily build a "global" view of DV types like we do
> for field numbers and be consistent across DWPT

if we do this, i would rather extend the fieldnumbers and:
* throw exception if indexOptions conflict (e.g. omitTF=true versus
false) instead of silently dropping positions on merge
* same with omitNorms
* same with norms types and docvalues types
* still keeping field numbers consistent

this way i think we could eliminate all these traps and just give an
exception instead.

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: mixing dv types in one IW session

Michael McCandless-2
On Tue, May 29, 2012 at 3:00 PM, Robert Muir <[hidden email]> wrote:

> On Tue, May 29, 2012 at 2:29 PM, Simon Willnauer
> <[hidden email]> wrote:
>> +1 I think we can easily build a "global" view of DV types like we do
>> for field numbers and be consistent across DWPT
>
> if we do this, i would rather extend the fieldnumbers and:
> * throw exception if indexOptions conflict (e.g. omitTF=true versus
> false) instead of silently dropping positions on merge
> * same with omitNorms
> * same with norms types and docvalues types
> * still keeping field numbers consistent
>
> this way i think we could eliminate all these traps and just give an
> exception instead.

+1

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: mixing dv types in one IW session

Simon Willnauer
On Tue, May 29, 2012 at 9:03 PM, Michael McCandless
<[hidden email]> wrote:

> On Tue, May 29, 2012 at 3:00 PM, Robert Muir <[hidden email]> wrote:
>> On Tue, May 29, 2012 at 2:29 PM, Simon Willnauer
>> <[hidden email]> wrote:
>>> +1 I think we can easily build a "global" view of DV types like we do
>>> for field numbers and be consistent across DWPT
>>
>> if we do this, i would rather extend the fieldnumbers and:
>> * throw exception if indexOptions conflict (e.g. omitTF=true versus
>> false) instead of silently dropping positions on merge
>> * same with omitNorms
>> * same with norms types and docvalues types
>> * still keeping field numbers consistent
>>
>> this way i think we could eliminate all these traps and just give an
>> exception instead.

I will create an issue for this! this make perfect sense and removes
all those traps

simon

>
> +1
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]