faceting and categorizing on color?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

faceting and categorizing on color?

James Pine
Hey Everyone,

I have been reading several threads about facet counts
and category counts and was wondering if/how they
might apply to searching for colors. Let's say that
there is a Lucene index where each document
corresponds to an image. In addition, each document
contains the top 10 most frequently occurring colors
(hex values) in its associated image along with some
other axillary data. So a few example documents might
look like this:

Field Value
IMAGE 1
COLORS F00000 FF0000 FFF000 FFFF00 FFFFF0 FFFFFF
E00000 EE0000 EEE000 EEEE00
PIXELS 10000
FORMAT JPG

Field Value
IMAGE 2
COLORS F00000 F0F000 F0FF00 F0FFF0 F0FFFF FFFFFF
E00000 E0E000 E0EE00 E0EEE0
PIXELS 50000
FORMAT JPG

Field Value
IMAGE 3
COLORS F00000 F00F00 F00FF0 F00FFF FFFFFF E00000
E0E000 E0EE00 E0EEE0 E0EEEE
PIXELS 1000
FORMAT PNG

Field Value
IMAGE 4
COLORS F00000 F00F00 F00FF0 F00FFF CCCCCC E00000
E0E000 E0EE00 E0EEE0 E0EEEE
PIXELS 20000
FORMAT GIF

If I setup the COLORS field such that it is of type
Field.Keyword, then I can construct a query
COLORS:FFFFFF with a StandardAnalyzer that returns the
first 3 documents. The additional bit of information
I'd like to calculate/return is the list of colors the
images share, in order of most frequent to least
frequent i.e.

F00000, 3 occurrences
E00000, 3 occurrences
E0E000, 2 occurrences
E0EE00, 2 occurrences
FF0000, 1 occurrence
...

I could use a custom HitCollector to collect and order
this information, but it would cause a sizable
performance hit.This thread from last year:
http://www.nabble.com/forum/ViewTopic.jtp?topic=266441&tview=threaded

Seems to suggest the creation of a String[] of colors
and then iterate over it using a QueryFilter, AND'ing
the result sets bits with the filtered bits, sorting
based on cardinality. That works in my tests with a
limited number of colors, but my String[] of colors
could contain 16.7M values, and most certainly will
contain at least several million.

I read up on BitSets and Solr's use of them as well in
this thread:
http://www.gossamer-threads.com/lists/lucene/java-user/35427?do=post_view_threaded

But I think that the solutions suggested assume
several hundred facets i.e. price, manufacturer, size,
etc. each one having a few hundred possible values.
Those would make sense so that I could provide
filtering based on FORMAT or number of PIXELS, but
again, don't seem to scale in the colors example.

Am I mistaken in thinking that my problem is similar
to faceting/categorization/rollup, just on a different
scale? I suppose I could also structure my document
such that it looked like this:

Field Value
IMAGE 1
COLOR F00000
COLOR FF0000
COLOR FFF000
COLOR FFFF00
COLOR FFFFF0
COLOR FFFFFF
COLOR E00000
COLOR EE0000
COLOR EEE000
COLOR EEEE00
PIXELS 10000
FORMAT JPG

That would allow me to do something with TermEnums,
but I don't see how it would be any better than using
a HitCollector. Does the above document structure make
things any better/worse/the same compared to the
initial document structure? If the answer to my
question is trivial would anything change if I wanted
to weight the colors. For example, if the initial
document looked like:

Field Value
IMAGE 1
COLORS 10-F00000 6-FF0000 3-FFF000 9-FFFF00 7-FFFFF0
1-FFFFFF 2-E00000 4-EE0000 8-EEE000 5-EEEE00
PIXELS 10000
FORMAT JPG

That would be interpreted as FFFFFF is the most
dominant of the top 10 most dominant colors and F00000
is the least. Would that be a terrible document
structure?

Thanx in advance, especially for wading through this
entire post :o)

JAMES

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: faceting and categorizing on color?

Chris Hostetter-3

: I have been reading several threads about facet counts
: and category counts and was wondering if/how they
: might apply to searching for colors. Let's say that

First off, let me clear up somethign regarding your index field structure,
you mentioned that you currently have documents that look like this...

: IMAGE 1
: COLORS F00000 FF0000 FFF000 FFFF00 FFFFF0 FFFFFF
: E00000 EE0000 EEE000 EEEE00

        ...

: If I setup the COLORS field such that it is of type
: Field.Keyword, then I can construct a query
: COLORS:FFFFFF with a StandardAnalyzer that returns the

If you are indexing it as Field.Keyword and you can query for
COLORS:FFFFFF and get results, then either you are only only getting
documents that are 100% white, or when you indexed the Documents you added
eeach collor as a seperate field instance -- I'm going to assume it's the
later, which makes perfect sense and is a reasonable way to do things, but
it's also exactly the same as what you ask about a little later..

: Am I mistaken in thinking that my problem is similar
: to faceting/categorization/rollup, just on a different
: scale? I suppose I could also structure my document
: such that it looked like this:
:
: Field Value
: IMAGE 1
: COLOR F00000
: COLOR FF0000

...unless of course, i'm completley missunderstanding you.

On to the meat of your question...

: performance hit.This thread from last year:
: http://www.nabble.com/forum/ViewTopic.jtp?topic=266441&tview=threaded
:
: Seems to suggest the creation of a String[] of colors
: and then iterate over it using a QueryFilter, AND'ing
: the result sets bits with the filtered bits, sorting
: based on cardinality. That works in my tests with a
: limited number of colors, but my String[] of colors
: could contain 16.7M values, and most certainly will
: contain at least several million.

Actually, the idea behind that approach is that you *want* to build up a
QueryFilter (or some other BitSetish thing) for every possible term value,
even if most of the time you won't be looking at all of them
because you can reuse a differnet subset of all of them on every query --
it's a time/space tradeoff, really fast set lookups/intersections in
exchange of a lot of RAM.

: Am I mistaken in thinking that my problem is similar
: to faceting/categorization/rollup, just on a different

You're not mistaken.  One thing you may want to consider is wether or not
it's really usefull to provide "facet" counts for each of the unique
color values in the 256^3 RGB spectrum, or if it's better (and easier) to
facet on the most significant bytes of the color value.  Searching for
"white" and seeing...

     FFFFFF  22,143 photos
     FFFFFE  22,134 photos
     FEFFFF  22,121 photos
     FFFEFF  22,098 photos
     FEFFFE  22,088 photos

...as the top 5 common colors in my results isn't all the helpfull is it?

If you index both the full color codes as well as just the most
significant hex characters from the RGB code in seperate fields, you can
do "coarse" faceting on one field, and "refined" faceting on the other...

IMAGE:  1
COLOR: FFFFFE FFA3B4 2287C3 773442 666BED
COARSE_COLOR: FFF FAB 28C 734 66E

: That would allow me to do something with TermEnums,
: but I don't see how it would be any better than using
: a HitCollector. Does the above document structure make

Usually, you need to do one pass over your values and one pass over
your documents -- with a HitCollector you do a pass over your values first
(either by constructing a FieldCache or by doing something fancy with
TermEnum).  Alternately, you can do one pass over your documents building
a BitSet of the docs that match, and *then* do a pass over the values
using a TermEnum/TermDocs to get interestign stats about them

If you are interested in doing the coarse appraoch, the QueryFilter/BitSet
approach would probably work very well (only 4096 total BitSets ever).
Once a coarse color was choosen you can use the TermEnum to get a list of
all the subtle shades and TermDocs to check which docs already in your set
match on that shade.

if you store the least significant hex characters last, you can reduce
the amount of skipping arround you have to do...

IMAGE:  1
COLOR: FFFFFE FFA3B4 2587C3 773442 666BED
COARSE_COLOR: FFF FAB 28C 734 66E
REFINE_COLOR: FFFFFE FABF34 28C573 734742 66E6BD

: initial document structure? If the answer to my
: question is trivial would anything change if I wanted
: to weight the colors. For example, if the initial
: document looked like:
:
: Field Value
: IMAGE 1
: COLORS 10-F00000 6-FF0000 3-FFF000 9-FFFF00 7-FFFFF0
: 1-FFFFFF 2-E00000 4-EE0000 8-EEE000 5-EEEE00
: PIXELS 10000
: FORMAT JPG
:
: That would be interpreted as FFFFFF is the most
: dominant of the top 10 most dominant colors and F00000
: is the least. Would that be a terrible document
: structure?

thigns certainly get a lot harder when you want to weight them ... but not
impossible.  Using term frequencies would probably be the best way to do
this (ie: add COLORS:F00000 to the 10 times, and COLOR:FFFFFF only once)
and would have the added benefit that this document would score better on
a search for COLORS:F00000 then it would on a search for COLOR:FFFFFF.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: faceting and categorizing on color?

James Pine
> First off, let me clear up somethign regarding your
> index field structure,
> you mentioned that you currently have documents that
> look like this...
>
> : IMAGE 1
> : COLORS F00000 FF0000 FFF000 FFFF00 FFFFF0 FFFFFF
> : E00000 EE0000 EEE000 EEEE00
>
> If you are indexing it as Field.Keyword and you can
> query for
> COLORS:FFFFFF and get results, then either you are
> only only getting
> documents that are 100% white, or when you indexed
> the Documents you added
> eeach collor as a seperate field instance

I thought that having: F00000 FF0000 FFFFFF FFFF00...
in one field and then searching for FFFFFF in it would
match all documents that contain that "word" so I
would be finding documents that could be < 100% white.
In either case, I don't actually have a document
structure yet ;o) It seems though that you are
recommending this document structure:

Field Value
IMAGE 1
COLOR FF0000
COLOR 00FF00
COLOR 0000FF
COLOR 808080


And then if this image were 50% Red, 25% Green, 12.5%
Blue and 12.5% Grey the document would look like this:

Field Value
IMAGE 1
COLOR FF0000
COLOR FF0000
COLOR FF0000
COLOR FF0000
COLOR 00FF00
COLOR 00FF00
COLOR 0000FF
COLOR 808080

Is that correct? I wrote a unit test to compare term
document frequencies while iterating through the
TermEnum object returned by IndexReader.terms() and
the counts were equal. I guess I am still not clear on
what the differences/advantages/disadvantages are
between the above an document and one that like this:

Field Value
IMAGE 1
COLORS FF0000 FF0000 FF0000 FF0000 00FF00 00FF00
0000FF 808080

In fact with all the color values stored in a single
field, if I did some sort of pixel region
mapping/resolution reduction then wouldn't it lend
itself to Span/Fuzzy queries so I could not only
search for images which have both red and green, but
rank them based on how close the colors are to each
other? Would I lose that capability if I stored each
color in its own field? So perhaps a more general
question is when is it better to collapse a bunch of
words into a single field vs. spread them out over
many fields, all of which have the same field name?

> If you index both the full color codes as well as
> just the most
> significant hex characters from the RGB code in
> seperate fields, you can
> do "coarse" faceting on one field, and "refined"
> faceting on the other...
>
> IMAGE:  1
> COLOR: FFFFFE FFA3B4 2287C3 773442 666BED
> COARSE_COLOR: FFF FAB 28C 734 66E

That's a great idea and seems way more straightforward
than the RGB/HSV etc. distance calculation algorithms
I've been reading about :o) I'll have to run some
tests to see how accurate that reduction appears to
people.

> Usually, you need to do one pass over your values
> and one pass over
> your documents -- with a HitCollector you do a pass
> over your values first
> Alternately, you can do one pass over
> your documents building
> a BitSet of the docs that match, and *then* do a
> pass over the values

Huh, perhaps I don't understand the HitCollector
fully. Are you saying that if I have an index with 100
documents, each of which have a color field (let's say
25 of the documents have FFFFFF in the color field)
and I do a search for FFFFFF...using a HitCollector
I'm iterating over all 100 documents while extracting
values whereas with the regular Hits based search I
would only be iterating over 25?

Thanx again.

JAMES

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: faceting and categorizing on color?

Chris Hostetter-3

: I thought that having: F00000 FF0000 FFFFFF FFFF00...
: in one field and then searching for FFFFFF in it would
: match all documents that contain that "word" so I
        ...
: the counts were equal. I guess I am still not clear on
: what the differences/advantages/disadvantages are
: between the above an document and one that like this:
        ...
: color in its own field? So perhaps a more general
: question is when is it better to collapse a bunch of
: words into a single field vs. spread them out over
: many fields, all of which have the same field name?

My point was that (depending on what exactly you were trying to express)
when you try to represent a document's structure in email like that both
examples have the exact same structure in the index.  For any given field
name, there is only one Field in the index, and it has a stream of values
regardless of how you constructed your document object -- technically
there isn't even really a "field" as much as there is an ordered list of
Terms.  Specificly, this...

  IndexWriter w = new IndexWriter(d, new WhitespaceAnalyzer(), true);
  Document doc = new Document();
  doc.add(new Field("color", "red blue green", Store.NO, Index.TOKENIZED));
  w.addDocument(doc);

...is going to produce and index with the exact same terms maped to that
document as this...

  IndexWriter w = new IndexWriter(d, new WhitespaceAnalyzer(), true);
  Document doc = new Document();
  doc.add(new Field("color", "red", Store.NO, Index.TOKENIZED));
  doc.add(new Field("color", "blue", Store.NO, Index.TOKENIZED));
  doc.add(new Field("color", "green", Store.NO, Index.TOKENIZED));
  w.addDocument(doc);

...the only differnece that might pop up, is if you configure your
analyzer to have a positionIncrimentGap greater then 0, in which case
there will be a bigger "gap" between the red and blue and green in the
second case then in the fist (which will affect any sloppy searches that
you might do)

LIA has a lot of gret info on exactly how thisworks, and can relaly help
you "think" in terms of Terms.

: That's a great idea and seems way more straightforward
: than the RGB/HSV etc. distance calculation algorithms
: I've been reading about :o) I'll have to run some
: tests to see how accurate that reduction appears to
: people.

that's just an easy one if you are already representing your colors in RGB
hex codes -- the bottom line is any method you have for simplifying your
pallet will make it easier to do facet counts (because you ar reducing the
number of facets)

: Huh, perhaps I don't understand the HitCollector
: fully. Are you saying that if I have an index with 100
: documents, each of which have a color field (let's say
: 25 of the documents have FFFFFF in the color field)
: and I do a search for FFFFFF...using a HitCollector
: I'm iterating over all 100 documents while extracting
: values whereas with the regular Hits based search I
: would only be iterating over 25?

First off, forget the Hits class exists -- it has no place in a discussion
of facet counts where you need to look at every document that matches a
given query.

Second: my point is that when you are starting with an arbitrary query,
and then you wnat to provide counts for a bunch of facet values (ie:
colors) you have no idea which facet values are the "best" -- you have no
way of knowing that you should start with FFFFFF, you have to try them
all...

Conceptually facets are about set intersections and the cardinality of
those intersections.  You have a main result set of documents which match
your user's query, and you have many hypothetical sets of documents that
each represent all docs that contain some single color that deifines that
set.  You want to intersect each of those sets with your main result set,
and find out which colors product the greatest intersection cardinality.

With the BitSet approach, you can directly implimenting this comceptual
model -- making real sets out of your hypothetical sets.  But if that's
not feasible from a memory usage perspective, then you can traverse the
two dimensions of the problem space (colors and documents) in either
order:
 1) you can use the FieldCache (or some other data structure you build
from TermEnums) to have fast access to the list of colors in each doc, and
then in your HitCollector as you collect each docId matching your primary
query, incriment a counter for the corrisponding colors.
 2) you can start with a HitCollector that just records all of the docIds
that match your primary query -- a BitSet is convinient for this -- and
then use a TermEnum with a TermDocs to iterate over every color in your
index, and count up how many of the documents those colors map to are in
your BitSet

(NOTE: using a FieldCache for this is non-trivial since FieldCache only
works if each doc has at most one value, i wasn't really thinking about
the specifics of your color example when i mentioned it before -- but the
concept is still the same: an array that allows fast lookup by docId, you
could make your own array containing the arrays of values just as easily
as the FieldCache)





-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]