Color search

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Color search

Guangwei Yuan
Hi,

We're running an e-commerce site that provides product search. We've been
able to extract colors from product images, and we think it'd be cool and
useful to search products by color. A product image can have up to 5 colors
(from a color space of about 100 colors), so we can implement it easily with
Solr's facet search (thanks all who've developed Solr).

The problem arises when we try to sort the results by the color relevancy.
What's different from a normal facet search is that colors are weighted. For
example, a black dress can have 70% of black, 20% of gray, 10% of brown. A
search query "color:black" should return results in which the black dress
ranks higher than other products with less percentage of black.

My question is: how to configure and index the color field so that products
with higher percentage of color X ranks higher for query "color:X"?

Thanks for your help!

- Guangwei
Reply | Threaded
Open this post in threaded view
|

Re: Color search

Yonik Seeley-2
If it were just a couple of colors, you could have a separate field
for each color and then index the percent in that field.

black:70
grey:20

and then you could use a function query to influence the score (or you
could sort by the color percent).

However, this doesn't scale well to a large index with a large number of colors.
Each field used like that will take up 4 bytes per document in the index.

so if you have 1M documents, that's 1Mdocs * 100colors * 4bytes = 400MB
Doable depending on your index size (use "int" or "float" and not
"sint" or "sfloat" type for this... it will be better on the memory).

If you needed to be better on the memory, you could encode all of the
colors into a single value (perhaps into a compact string... one
percentile per byte or something) and then have a custom function that
extracts the value for a particular color.  (this involves some java
development)

-Yonik


On 9/28/07, Guangwei Yuan <[hidden email]> wrote:

> Hi,
>
> We're running an e-commerce site that provides product search. We've been
> able to extract colors from product images, and we think it'd be cool and
> useful to search products by color. A product image can have up to 5 colors
> (from a color space of about 100 colors), so we can implement it easily with
> Solr's facet search (thanks all who've developed Solr).
>
> The problem arises when we try to sort the results by the color relevancy.
> What's different from a normal facet search is that colors are weighted. For
> example, a black dress can have 70% of black, 20% of gray, 10% of brown. A
> search query "color:black" should return results in which the black dress
> ranks higher than other products with less percentage of black.
>
> My question is: how to configure and index the color field so that products
> with higher percentage of color X ranks higher for query "color:X"?
>
> Thanks for your help!
>
> - Guangwei
>
Reply | Threaded
Open this post in threaded view
|

Re: Color search

Grant Ingersoll-2
Another option would be to extend Solr (and donate back) to  
incorporate Lucene's payload functionality, in which case you could  
associate the percentile of the color as a payload and use the  
BoostingTermQuery... :-)  If you're interested in this, a discussion  
on solr-dev is probably warranted to figure out the best way to do this.

-Grant

On Sep 28, 2007, at 9:23 AM, Yonik Seeley wrote:

> If it were just a couple of colors, you could have a separate field
> for each color and then index the percent in that field.
>
> black:70
> grey:20
>
> and then you could use a function query to influence the score (or you
> could sort by the color percent).
>
> However, this doesn't scale well to a large index with a large  
> number of colors.
> Each field used like that will take up 4 bytes per document in the  
> index.
>
> so if you have 1M documents, that's 1Mdocs * 100colors * 4bytes =  
> 400MB
> Doable depending on your index size (use "int" or "float" and not
> "sint" or "sfloat" type for this... it will be better on the memory).
>
> If you needed to be better on the memory, you could encode all of the
> colors into a single value (perhaps into a compact string... one
> percentile per byte or something) and then have a custom function that
> extracts the value for a particular color.  (this involves some java
> development)
>
> -Yonik
>
>
> On 9/28/07, Guangwei Yuan <[hidden email]> wrote:
>> Hi,
>>
>> We're running an e-commerce site that provides product search.  
>> We've been
>> able to extract colors from product images, and we think it'd be  
>> cool and
>> useful to search products by color. A product image can have up to  
>> 5 colors
>> (from a color space of about 100 colors), so we can implement it  
>> easily with
>> Solr's facet search (thanks all who've developed Solr).
>>
>> The problem arises when we try to sort the results by the color  
>> relevancy.
>> What's different from a normal facet search is that colors are  
>> weighted. For
>> example, a black dress can have 70% of black, 20% of gray, 10% of  
>> brown. A
>> search query "color:black" should return results in which the  
>> black dress
>> ranks higher than other products with less percentage of black.
>>
>> My question is: how to configure and index the color field so that  
>> products
>> with higher percentage of color X ranks higher for query "color:X"?
>>
>> Thanks for your help!
>>
>> - Guangwei
>>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ


Reply | Threaded
Open this post in threaded view
|

Re: Color search

steve_rowe
In reply to this post by Guangwei Yuan
Hi Guangwei,

When you index your products, you could have a single color field, and
include duplicates of each color component proportional to its weight.

For example, if you decide to use 10% increments, for your black dress
with 70% of black, 20% of gray, 10% of brown, you would index the
following terms for the color field:

  black black black black black black black
  gray gray
  brown

This works because Lucene natively interprets document term frequencies
as weights.

Steve

Guangwei Yuan wrote:

> Hi,
>
> We're running an e-commerce site that provides product search. We've been
> able to extract colors from product images, and we think it'd be cool and
> useful to search products by color. A product image can have up to 5 colors
> (from a color space of about 100 colors), so we can implement it easily with
> Solr's facet search (thanks all who've developed Solr).
>
> The problem arises when we try to sort the results by the color relevancy.
> What's different from a normal facet search is that colors are weighted. For
> example, a black dress can have 70% of black, 20% of gray, 10% of brown. A
> search query "color:black" should return results in which the black dress
> ranks higher than other products with less percentage of black.
>
> My question is: how to configure and index the color field so that products
> with higher percentage of color X ranks higher for query "color:X"?
>
> Thanks for your help!
>
> - Guangwei
Reply | Threaded
Open this post in threaded view
|

RE: Color search

Renaud Waldura-5
Here's another idea: encode color mixes as one RGB value (32 bits) and sort
according to those values. To find the closest color is like finding the
closest points in the color space. It would be like a distance search.

70% black #000000 = 0
20% gray #f0f0f0 = #303030
10% brown #8b4513 = #0e0702
= #3e3732

The distance would be:
sqrt( (r1 - r0)^2 + (g1 - g0)^2 + (b1 - b0)^2 )

Where r0g0b0 is the color the user asked for, and r1g1b1 is the composite
color of the item, calculated above.

--Renaud


-----Original Message-----
From: Steven Rowe [mailto:[hidden email]]
Sent: Friday, September 28, 2007 7:14 AM
To: [hidden email]
Subject: Re: Color search

Hi Guangwei,

When you index your products, you could have a single color field, and
include duplicates of each color component proportional to its weight.

For example, if you decide to use 10% increments, for your black dress with
70% of black, 20% of gray, 10% of brown, you would index the following terms
for the color field:

  black black black black black black black
  gray gray
  brown

This works because Lucene natively interprets document term frequencies as
weights.

Steve

Guangwei Yuan wrote:
> Hi,
>
> We're running an e-commerce site that provides product search. We've
> been able to extract colors from product images, and we think it'd be
> cool and useful to search products by color. A product image can have
> up to 5 colors (from a color space of about 100 colors), so we can
> implement it easily with Solr's facet search (thanks all who've developed
Solr).
>
> The problem arises when we try to sort the results by the color relevancy.
> What's different from a normal facet search is that colors are
> weighted. For example, a black dress can have 70% of black, 20% of
> gray, 10% of brown. A search query "color:black" should return results
> in which the black dress ranks higher than other products with less
percentage of black.
>
> My question is: how to configure and index the color field so that
> products with higher percentage of color X ranks higher for query
"color:X"?
>
> Thanks for your help!
>
> - Guangwei


Reply | Threaded
Open this post in threaded view
|

Re: Color search

steve_rowe
Hi Renaud,

I think your method will produce strange results, probably in most
cases, e.g.

33% red #FF0000 = #550000
33% green #00FF00 = #005500
33% blue #0000FF = #000055
= #555555

Thus, red, green and blue dress would score well against a search for
"medium gray".  Not good.

Steve

Renaud Waldura wrote:

> Here's another idea: encode color mixes as one RGB value (32 bits) and sort
> according to those values. To find the closest color is like finding the
> closest points in the color space. It would be like a distance search.
>
> 70% black #000000 = 0
> 20% gray #f0f0f0 = #303030
> 10% brown #8b4513 = #0e0702
> = #3e3732
>
> The distance would be:
> sqrt( (r1 - r0)^2 + (g1 - g0)^2 + (b1 - b0)^2 )
>
> Where r0g0b0 is the color the user asked for, and r1g1b1 is the composite
> color of the item, calculated above.
>
> --Renaud
>
>
> -----Original Message-----
> From: Steven Rowe [mailto:[hidden email]]
> Sent: Friday, September 28, 2007 7:14 AM
> To: [hidden email]
> Subject: Re: Color search
>
> Hi Guangwei,
>
> When you index your products, you could have a single color field, and
> include duplicates of each color component proportional to its weight.
>
> For example, if you decide to use 10% increments, for your black dress with
> 70% of black, 20% of gray, 10% of brown, you would index the following terms
> for the color field:
>
>   black black black black black black black
>   gray gray
>   brown
>
> This works because Lucene natively interprets document term frequencies as
> weights.
>
> Steve
>
> Guangwei Yuan wrote:
>> Hi,
>>
>> We're running an e-commerce site that provides product search. We've
>> been able to extract colors from product images, and we think it'd be
>> cool and useful to search products by color. A product image can have
>> up to 5 colors (from a color space of about 100 colors), so we can
>> implement it easily with Solr's facet search (thanks all who've developed
> Solr).
>> The problem arises when we try to sort the results by the color relevancy.
>> What's different from a normal facet search is that colors are
>> weighted. For example, a black dress can have 70% of black, 20% of
>> gray, 10% of brown. A search query "color:black" should return results
>> in which the black dress ranks higher than other products with less
> percentage of black.
>> My question is: how to configure and index the color field so that
>> products with higher percentage of color X ranks higher for query
> "color:X"?
>> Thanks for your help!
>>
>> - Guangwei
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Color search

Matthew Runo
In reply to this post by Guangwei Yuan
This discussion is incredibly interesting to me! We solved this  
simply by indexing the color names, and faceting on that. Not a very  
elegant solution, to be sure - but it works. If people search for a  
"green running shoe" they get -green- running shoes.

I would be very very interested in having a color picker ajax app  
which then went out and found the products with colors most like the  
one you chose.

+--------------------------------------------------------+
  | Matthew Runo
  | Zappos Development
  | [hidden email]
  | 702-943-7833
+--------------------------------------------------------+


On Sep 28, 2007, at 1:00 AM, Guangwei Yuan wrote:

> Hi,
>
> We're running an e-commerce site that provides product search.  
> We've been
> able to extract colors from product images, and we think it'd be  
> cool and
> useful to search products by color. A product image can have up to  
> 5 colors
> (from a color space of about 100 colors), so we can implement it  
> easily with
> Solr's facet search (thanks all who've developed Solr).
>
> The problem arises when we try to sort the results by the color  
> relevancy.
> What's different from a normal facet search is that colors are  
> weighted. For
> example, a black dress can have 70% of black, 20% of gray, 10% of  
> brown. A
> search query "color:black" should return results in which the black  
> dress
> ranks higher than other products with less percentage of black.
>
> My question is: how to configure and index the color field so that  
> products
> with higher percentage of color X ranks higher for query "color:X"?
>
> Thanks for your help!
>
> - Guangwei

Reply | Threaded
Open this post in threaded view
|

Re: Color search

hossman
In reply to this post by Guangwei Yuan

: useful to search products by color. A product image can have up to 5 colors
: (from a color space of about 100 colors), so we can implement it easily with
: Solr's facet search (thanks all who've developed Solr).
:
: The problem arises when we try to sort the results by the color relevancy.
: What's different from a normal facet search is that colors are weighted. For
: example, a black dress can have 70% of black, 20% of gray, 10% of brown. A

if 5 is a hard max on the number of colors that you support, then you can
always use 5 seperate fields to store the colors in order of "dominance"
and then query on those 5 fields with varying boosts...

 color_1:black^10 color_2:black^7 color_3:black^4 color_4:black color_5:black^0.1

...something like this will loose the % granularity info that you have (so
a 60% black skirt and an 80% black dress would both score the same against
black since it's hte dominant color)

alternately: i'm assuming your percentage data only has so much confidence
-- maybe on the order of 10%?.  you can have a seperate field for each
"bucket" of color percentages and index the name of hte color in the
corrisponding bucket.  with 10% granularity that's only 10 fields -- a 10
clause boolean query for the color is no big deal ... even going to 5%
would be trivial.


Incidently: people interested in teh general topic of color faceting at
a finer granularity then just color names may want to check out this
thread from last...

http://www.nabble.com/faceting-and-categorizing-on-color--tf1801106.html



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Color search

Mike Klaas
In reply to this post by Grant Ingersoll-2

On 28-Sep-07, at 6:31 AM, Grant Ingersoll wrote:

> Another option would be to extend Solr (and donate back) to  
> incorporate Lucene's payload functionality, in which case you could  
> associate the percentile of the color as a payload and use the  
> BoostingTermQuery... :-)  If you're interested in this, a  
> discussion on solr-dev is probably warranted to figure out the best  
> way to do this.

For reference, here is a summary of the changes needed:

1. A payload analyzer (here is an example that tokenizes strings of  
<token>:<whatever>:<score> into <token> with payload <score>:

   /** Returns the next token in the stream, or null at EOS. */
   public final Token next() throws IOException {
     Token t = input.next();
     if (null == t)
       return null;

     String s = t.termText();
     if(s.indexOf(":") > -1 ) {
       String []parts = s.split(":");
       assert parts.length == 3;
       String colour = parts[0];
       int bits = Float.floatToIntBits(Float.parseFloat(parts[1]));
       byte []buf = new byte[4];
       for(int shift=0, i=0; shift < 32; shift += 8, i++) {
         buf[i] = (byte)( (bits>>shift) & 0xff );
       }
       Token gen = new Token(colour, t.startOffset(), t.endOffset());
       gen.setPayload(new Payload(buf));
       t = gen;
     }
     return t;

   }


2. A payload deserializer.  Add this method to your custom Similarity  
class:

   public float scorePayload(byte [] payload, int offset, int length) {
     assert length == 4;
     int accum = ((payload[0+offset]&0xff)) |
                 ((payload[1+offset]&0xff)<<8) |
                 ((payload[2+offset]&0xff)<<16)  |
                 ((payload[3+offset]&0xff)<<24);
     return Float.intBitsToFloat(accum);
  }

3. Add a relevant query clause.  In a custom request handler, you  
could have a parameter to add BoostingTermQueries:

  q= new BoostingTermQuery(new Term("colourPayload", colour))
query.add(q, Occur.SHOULD);

How to add this generically is an interesting question.  There are  
many possibilities, especially on the request handler and tokenizer  
side of things.  If there is a consensus on a sensible way of doing  
this, I could contribute the bits of code that I have.

HTH,
-Mike

Reply | Threaded
Open this post in threaded view
|

Re: Color search

Guangwei Yuan
Thanks for all the replies. I think creating 10 fields and feeding each
field with a color's value for 10% from that color is a reasonable approach,
and easy to implement too. One problem though, is that not all products have
a total of 100% colors (due to various reasons including our color
extraction algorithm, etc.) So, for a product with 50% of #000000, and 20%
of #999999, I'll have to fill the remaining three fields with some dummy
values. Otherwise, Lucene seems to score it higher than products that also
have 50% of #000000, but more than 20% of some other colors. Since I also
need a way to exclude the dummy value when faceting, is there a neater
solution?

I'll certainly look at the payload functionality, which is new to me :)

- Guangwei
Reply | Threaded
Open this post in threaded view
|

Re: Color search

hossman

: extraction algorithm, etc.) So, for a product with 50% of #000000, and 20%
: of #999999, I'll have to fill the remaining three fields with some dummy
: values. Otherwise, Lucene seems to score it higher than products that also
: have 50% of #000000, but more than 20% of some other colors. Since I also

that doesn't really make sense to me ... your input is colors to search
for, and you query each of those colors against every field right?  so if
i said i want grey and red dresses, you query for...

        +(c0:grey c1:grey c2:grey c3:grey c4:grey
   c5:grey c6:grey c7:grey c8:grey c9:grey)
        +(c0:red c1:red c2:red c3:red c4:red
          c5:red c6:red c7:red c8:red)

...right?  a document that doesn't have any value in c6, c7 or c8
shouldn't score higher then any other documents ... if anything it should
score lower because of the coord factor.

can you you explain exactly how you are indexing the data and what your
query looks like?




-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Color search

Guangwei Yuan
>
> can you you explain exactly how you are indexing the data and what your
> query looks like?
>

I used the same field name (color), not 10 different names (c0 - c9).

So the index fields look like (50% #000000, 20% #999999):
color: #000000
color: #000000
color: #000000
color: #000000
color: #000000
color: #999999
color: #999999

The query for black dresses will be:
color:#000000
Reply | Threaded
Open this post in threaded view
|

Re: Color search

hossman

: I used the same field name (color), not 10 different names (c0 - c9).

ah .. got it.  then what you are probably seeing is because of length
normalization, if you use omitNorms="true" then it shouldn't matter.

(i don't know why i suggested a seperate field for each 10% block ... i'm
sure i had a good reason but i can't think of it now)


-Hoss