SortedSetDocValue vs BinaryDocValues

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

SortedSetDocValue vs BinaryDocValues


I need to add col1:Array[String], col2:Array[Int] and col3:Array[Float] to

col1: Array[String] sparse dimension from OLAP world

col2: Array[Int] + Array[Float] represents a sparse vector for sparse
measure from OLAP world with dictionary encoding for col1 mapped to col2

I have few options to implement it:

1. Use SortedSetDocValuesField for each one of them with String, Int and
Float mapped to Byte

2. Generate byte array from Array[String], Array[Int] and Array[Float] and
save them as a byteBlob using BinaryDocValuesField

I know for sure that Array[Int] and Array[Float] will compress better if I
save them using specific encoding but I am confused whether to use 1 or 2
to implement the idea.

1 has a limitation on the number of bytes I can save and I am not sure if
pushing a Set to serialize to disk is a good idea (I am not sure yet if a
Set is being serialized to disk, most likely not).

I am open to coming up with specific encoding for Array data type where it
re-uses the current String, Int and Float encodings that we already have.

It will be great if experts can provide some pointers on using
SortedSetDocValues or serialize/deserialize using BinaryDocValuesField. The
idea of sparse dimension and measure comes from Oracle Essbase and I
believe we may bring in tensors as well in future.