Group by in Lucene ?

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Group by in Lucene ?

Marcus Herou
Hi.

I have a situation where I'm searching amongst some 100K feeds and only want
one result per site in return. I have developed a really simple method of
grouping which just scrolls through the resultset(hitset) until a maxNum
docs of feeds from a set of unique sites is populated. Since I don't wanna
reinvent the wheel, I want to know if Lucene has something like this built.
I as well will use Solr soon and then my own homecooked recipe will not work
so I really need a standard way of doing this.

I know Nutch has something like it called depupField which default is set to
2.

Anyone?


Kindly

//Marcus

--
Marcus Herou Solution Architect & Core Java developer Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com
Reply | Threaded
Open this post in threaded view
|

Re: Group by in Lucene ?

Grant Ingersoll-2
Solr has an issue outstanding right now that implements something that  
may be close to what you want.  They are calling it Field Collapsing.  
See https://issues.apache.org/jira/browse/SOLR-236

-Grant

On Nov 5, 2007, at 12:57 AM, Marcus Herou wrote:

> Hi.
>
> I have a situation where I'm searching amongst some 100K feeds and  
> only want
> one result per site in return. I have developed a really simple  
> method of
> grouping which just scrolls through the resultset(hitset) until a  
> maxNum
> docs of feeds from a set of unique sites is populated. Since I don't  
> wanna
> reinvent the wheel, I want to know if Lucene has something like this  
> built.
> I as well will use Solr soon and then my own homecooked recipe will  
> not work
> so I really need a standard way of doing this.
>
> I know Nutch has something like it called depupField which default  
> is set to
> 2.
>
> Anyone?
>
>
> Kindly
>
> //Marcus
>
> --
> Marcus Herou Solution Architect & Core Java developer Tailsweep AB
> +46702561312
> [hidden email]
> http://www.tailsweep.com

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http://www.apachecon.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Group by in Lucene ?

Marcus Herou
Thanks. They seem to have got real far in the dev cycle on this. Seems like
it will hit the road in Solr 1.3.

However I would really like this feature to be developed for Core Lucene,
how do I start that process?
Develop it yourself you would say :) I'm serious isn't it a really cool and
useful feature ?

Kindly

//Marcus

On 11/5/07, Grant Ingersoll <[hidden email]> wrote:

>
> Solr has an issue outstanding right now that implements something that
> may be close to what you want.  They are calling it Field Collapsing.
> See https://issues.apache.org/jira/browse/SOLR-236
>
> -Grant
>
> On Nov 5, 2007, at 12:57 AM, Marcus Herou wrote:
>
> > Hi.
> >
> > I have a situation where I'm searching amongst some 100K feeds and
> > only want
> > one result per site in return. I have developed a really simple
> > method of
> > grouping which just scrolls through the resultset(hitset) until a
> > maxNum
> > docs of feeds from a set of unique sites is populated. Since I don't
> > wanna
> > reinvent the wheel, I want to know if Lucene has something like this
> > built.
> > I as well will use Solr soon and then my own homecooked recipe will
> > not work
> > so I really need a standard way of doing this.
> >
> > I know Nutch has something like it called depupField which default
> > is set to
> > 2.
> >
> > Anyone?
> >
> >
> > Kindly
> >
> > //Marcus
> >
> > --
> > Marcus Herou Solution Architect & Core Java developer Tailsweep AB
> > +46702561312
> > [hidden email]
> > http://www.tailsweep.com
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Boot Camp Training:
> ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http://www.apachecon.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Marcus Herou Solution Architect & Core Java developer Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com
Reply | Threaded
Open this post in threaded view
|

Re: Group by in Lucene ?

Grant Ingersoll-2

On Nov 5, 2007, at 7:49 AM, Marcus Herou wrote:

> Thanks. They seem to have got real far in the dev cycle on this.  
> Seems like
> it will hit the road in Solr 1.3.
>
> However I would really like this feature to be developed for Core  
> Lucene,
> how do I start that process?
> Develop it yourself you would say :) I'm serious isn't it a really  
> cool and
> useful feature ?


We're always open to well-thought out and tested patches.  See the  
Wiki for info on contributing.

-Grant


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http://www.apachecon.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Group by in Lucene ?

Marcus Herou
Cool.

I'll do since this is a field which I can spend time in.

Kindly

//Marcus
On 11/5/07, Grant Ingersoll <[hidden email]> wrote:

>
>
> On Nov 5, 2007, at 7:49 AM, Marcus Herou wrote:
>
> > Thanks. They seem to have got real far in the dev cycle on this.
> > Seems like
> > it will hit the road in Solr 1.3.
> >
> > However I would really like this feature to be developed for Core
> > Lucene,
> > how do I start that process?
> > Develop it yourself you would say :) I'm serious isn't it a really
> > cool and
> > useful feature ?
>
>
> We're always open to well-thought out and tested patches.  See the
> Wiki for info on contributing.
>
> -Grant
>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Boot Camp Training:
> ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http://www.apachecon.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Marcus Herou Solution Architect & Core Java developer Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com
Reply | Threaded
Open this post in threaded view
|

Re: Group by in Lucene ?

ninaS
Hey Marcus,

have you already implemented this feature?
I'm searching a group by function for lucene, too.

More precisely I need it in Compass, which is built on top of lucene.

I was thinking about using a HitCollector to get only one result per group.

How did you do it?

Cheers,
Nina


Marcus Herou-2 wrote
Cool.

I'll do since this is a field which I can spend time in.

Kindly

//Marcus
On 11/5/07, Grant Ingersoll <gsingers@apache.org> wrote:
>
>
> On Nov 5, 2007, at 7:49 AM, Marcus Herou wrote:
>
> > Thanks. They seem to have got real far in the dev cycle on this.
> > Seems like
> > it will hit the road in Solr 1.3.
> >
> > However I would really like this feature to be developed for Core
> > Lucene,
> > how do I start that process?
> > Develop it yourself you would say :) I'm serious isn't it a really
> > cool and
> > useful feature ?
>
>
> We're always open to well-thought out and tested patches.  See the
> Wiki for info on contributing.
>
> -Grant
>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Boot Camp Training:
> ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http://www.apachecon.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


--
Marcus Herou Solution Architect & Core Java developer Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com
Reply | Threaded
Open this post in threaded view
|

Re: Group by in Lucene ?

Marcus Herou
Hi.

I did partly solve this with Solr with faceting but it does not solve the
quite normally use feature in db's:
num_en_entries = select count distinct(id) from BlogEntry where
language='en'
num_sv_entries = select count distinct(id) from BlogEntry where
language='sv'

it solves however the feature:
select count(id),date from BlogEntry group by date

I now need this feature elsewhere when parsing accesslogs etc so I am
looking into MonetDB, LucidDB and FastBit. Sphinx search seem like they have
something like this:
http://www.sphinxsearch.com/docs/current.html#clustering

Did you ever try a HitCollector ?

//Marcus

On Wed, Dec 5, 2007 at 1:17 PM, ninaS <[hidden email]> wrote:

>
> Hey Marcus,
>
> have you already implemented this feature?
> I'm searching a group by function for lucene, too.
>
> More precisely I need it in Compass, which is built on top of lucene.
>
> I was thinking about using a HitCollector to get only one result per group.
>
> How did you do it?
>
> Cheers,
> Nina
>
>
>
> Marcus Herou-2 wrote:
> >
> > Cool.
> >
> > I'll do since this is a field which I can spend time in.
> >
> > Kindly
> >
> > //Marcus
> > On 11/5/07, Grant Ingersoll <[hidden email]> wrote:
> >>
> >>
> >> On Nov 5, 2007, at 7:49 AM, Marcus Herou wrote:
> >>
> >> > Thanks. They seem to have got real far in the dev cycle on this.
> >> > Seems like
> >> > it will hit the road in Solr 1.3.
> >> >
> >> > However I would really like this feature to be developed for Core
> >> > Lucene,
> >> > how do I start that process?
> >> > Develop it yourself you would say :) I'm serious isn't it a really
> >> > cool and
> >> > useful feature ?
> >>
> >>
> >> We're always open to well-thought out and tested patches.  See the
> >> Wiki for info on contributing.
> >>
> >> -Grant
> >>
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> http://lucene.grantingersoll.com
> >>
> >> Lucene Boot Camp Training:
> >> ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!
> http://www.apachecon.com
> >>
> >> Lucene Helpful Hints:
> >> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> >> http://wiki.apache.org/lucene-java/LuceneFAQ
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
> >
> > --
> > Marcus Herou Solution Architect & Core Java developer Tailsweep AB
> > +46702561312
> > [hidden email]
> > http://www.tailsweep.com
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Group-by-in-Lucene---tf4749806.html#a14170395
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Reply | Threaded
Open this post in threaded view
|

Re: Group by in Lucene ?

ninaS
Hello,

yes I tried HitCollector but I am not satisfied with it because you can not use sorting with HitCollector unless you implement a way to use TopFieldTocCollector. I did not manage to do that in a performant way.

It is easier to first do a normal search und "group by" afterwards:

Iterate through the result documents and take one of each group. Each document has a groupingKey. I remember which groupingKey is already used and don't take another document of this group into the result list.

Regards,
Nina
Reply | Threaded
Open this post in threaded view
|

Re: Group by in Lucene ?

ninaS
By the way: if you only need to count documents (count groups) HitCollector is a good choice. If you only count you don't need to sort anything.

ninaS wrote
Hello,

yes I tried HitCollector but I am not satisfied with it because you can not use sorting with HitCollector unless you implement a way to use TopFieldTocCollector. I did not manage to do that in a performant way.

It is easier to first do a normal search und "group by" afterwards:

Iterate through the result documents and take one of each group. Each document has a groupingKey. I remember which groupingKey is already used and don't take another document of this group into the result list.

Regards,
Nina
Reply | Threaded
Open this post in threaded view
|

Re: Group by in Lucene ?

Marcus Herou
Hi.

This is way too slow I think since what you are explaining is something I
already tested. However I might be using the HitCollector badly.

Please prove me wrong. Supplying some code which I tested this with.
It stores a hash of the value of the term in a TIntHashSet and just
calculates the size of that set.
This one takes approx 3 sec on about 0.5M rows = way too slow.


main test class:
public class GroupingTest
{
    protected static final Log log =
LogFactory.getLog(GroupingTest.class.getName());
    static DateFormat df = new SimpleDateFormat("yyyy-MM-dd");
    public static void main(String[] args)
    {
        Utils.initLogger();
        String[] fields =
{"uid","ip","date","siteId","visits","countryCode"};
        try
        {
            IndexFactory fact = new IndexFactory();
            String d = "/tmp/csvtest";
            fact.initDir(d);
            IndexReader reader = fact.getReader(d);
            IndexSearcher searcher = fact.getSearcher(d, reader);
            QueryParser parser = new MultiFieldQueryParser(fields,
fact.getAnalyzer());
            Query q = parser.parse("date:20090125");


            GroupingHitCollector coll = new GroupingHitCollector();
            coll.setDistinct(true);
            coll.setGroupField("uid");
            coll.setIndexReader(reader);
            long start = System.currentTimeMillis();
            searcher.search(q, coll);
            long stop = System.currentTimeMillis();
            System.out.println("Time: " + (stop-start) + ", distinct
count(uid):"+coll.getDistinctCount() + ", count(uid): "+coll.getCount());
        }
        catch (Exception e)
        {
            log.error(e.toString(), e);
        }
    }
}


public class GroupingHitCollector  extends HitCollector
{
    protected IndexReader indexReader;
    protected String groupField;
    protected boolean distinct;
    //protected TLongHashSet set;
    protected TIntHashSet set;
    protected int distinctSize;

    int count = 0;
    int sum = 0;

    public GroupingHitCollector()
    {
        set = new TIntHashSet();
    }

    public String getGroupField()
    {
        return groupField;
    }

    public void setGroupField(String groupField)
    {
        this.groupField = groupField;
    }

    public IndexReader getIndexReader()
    {
        return indexReader;
    }

    public void setIndexReader(IndexReader indexReader)
    {
        this.indexReader = indexReader;
    }

    public boolean isDistinct()
    {
        return distinct;
    }

    public void setDistinct(boolean distinct)
    {
        this.distinct = distinct;
    }

    public void collect(int doc, float score)
    {
        if(distinct)
        {
            try
            {
                Document document = this.indexReader.document(doc);
                if(document != null)
                {
                    String s = document.get(groupField);
                    if(s != null)
                    {
                        set.add(s.hashCode());
                        //set.add(Crc64.generate(s));
                    }
                }
            }
            catch (IOException e)
            {
                e.printStackTrace();
            }
        }
        count++;
        sum += doc;  // use it to avoid any possibility of being optimized
away
    }

    public int getCount() { return count; }
    public int getSum() { return sum; }

    public int getDistinctCount()
    {
        distinctSize = set.size();
        return distinctSize;
    }
}


On Wed, Jan 28, 2009 at 10:51 AM, ninaS <[hidden email]> wrote:

>
> By the way: if you only need to count documents (count groups) HitCollector
> is a good choice. If you only count you don't need to sort anything.
>
>
> ninaS wrote:
> >
> > Hello,
> >
> > yes I tried HitCollector but I am not satisfied with it because you can
> > not use sorting with HitCollector unless you implement a way to use
> > TopFieldTocCollector. I did not manage to do that in a performant way.
> >
> > It is easier to first do a normal search und "group by" afterwards:
> >
> > Iterate through the result documents and take one of each group. Each
> > document has a groupingKey. I remember which groupingKey is already used
> > and don't take another document of this group into the result list.
> >
> > Regards,
> > Nina
> >
>
> --
> View this message in context:
> http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702742.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Reply | Threaded
Open this post in threaded view
|

Re: Group by in Lucene ?

Marcus Herou
Oh bytw, faceting is easy it's the distinct part I think is hard.

Example Lucene Facet:
http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.html

On Wed, Jan 28, 2009 at 12:43 PM, Marcus Herou
<[hidden email]>wrote:

> Hi.
>
> This is way too slow I think since what you are explaining is something I
> already tested. However I might be using the HitCollector badly.
>
> Please prove me wrong. Supplying some code which I tested this with.
> It stores a hash of the value of the term in a TIntHashSet and just
> calculates the size of that set.
> This one takes approx 3 sec on about 0.5M rows = way too slow.
>
>
> main test class:
> public class GroupingTest
> {
>     protected static final Log log =
> LogFactory.getLog(GroupingTest.class.getName());
>     static DateFormat df = new SimpleDateFormat("yyyy-MM-dd");
>     public static void main(String[] args)
>     {
>         Utils.initLogger();
>         String[] fields =
> {"uid","ip","date","siteId","visits","countryCode"};
>         try
>         {
>             IndexFactory fact = new IndexFactory();
>             String d = "/tmp/csvtest";
>             fact.initDir(d);
>             IndexReader reader = fact.getReader(d);
>             IndexSearcher searcher = fact.getSearcher(d, reader);
>             QueryParser parser = new MultiFieldQueryParser(fields,
> fact.getAnalyzer());
>             Query q = parser.parse("date:20090125");
>
>
>             GroupingHitCollector coll = new GroupingHitCollector();
>             coll.setDistinct(true);
>             coll.setGroupField("uid");
>             coll.setIndexReader(reader);
>             long start = System.currentTimeMillis();
>             searcher.search(q, coll);
>             long stop = System.currentTimeMillis();
>             System.out.println("Time: " + (stop-start) + ", distinct
> count(uid):"+coll.getDistinctCount() + ", count(uid): "+coll.getCount());
>         }
>         catch (Exception e)
>         {
>             log.error(e.toString(), e);
>         }
>     }
> }
>
>
> public class GroupingHitCollector  extends HitCollector
> {
>     protected IndexReader indexReader;
>     protected String groupField;
>     protected boolean distinct;
>     //protected TLongHashSet set;
>     protected TIntHashSet set;
>     protected int distinctSize;
>
>     int count = 0;
>     int sum = 0;
>
>     public GroupingHitCollector()
>     {
>         set = new TIntHashSet();
>     }
>
>     public String getGroupField()
>     {
>         return groupField;
>     }
>
>     public void setGroupField(String groupField)
>     {
>         this.groupField = groupField;
>     }
>
>     public IndexReader getIndexReader()
>     {
>         return indexReader;
>     }
>
>     public void setIndexReader(IndexReader indexReader)
>     {
>         this.indexReader = indexReader;
>     }
>
>     public boolean isDistinct()
>     {
>         return distinct;
>     }
>
>     public void setDistinct(boolean distinct)
>     {
>         this.distinct = distinct;
>     }
>
>     public void collect(int doc, float score)
>     {
>         if(distinct)
>         {
>             try
>             {
>                 Document document = this.indexReader.document(doc);
>                 if(document != null)
>                 {
>                     String s = document.get(groupField);
>                     if(s != null)
>                     {
>                         set.add(s.hashCode());
>                         //set.add(Crc64.generate(s));
>                     }
>                 }
>             }
>             catch (IOException e)
>             {
>                 e.printStackTrace();
>             }
>         }
>         count++;
>         sum += doc;  // use it to avoid any possibility of being optimized
> away
>     }
>
>     public int getCount() { return count; }
>     public int getSum() { return sum; }
>
>     public int getDistinctCount()
>     {
>         distinctSize = set.size();
>         return distinctSize;
>
>     }
> }
>
>
> On Wed, Jan 28, 2009 at 10:51 AM, ninaS <[hidden email]> wrote:
>
>>
>> By the way: if you only need to count documents (count groups)
>> HitCollector
>> is a good choice. If you only count you don't need to sort anything.
>>
>>
>> ninaS wrote:
>> >
>> > Hello,
>> >
>> > yes I tried HitCollector but I am not satisfied with it because you can
>> > not use sorting with HitCollector unless you implement a way to use
>> > TopFieldTocCollector. I did not manage to do that in a performant way.
>> >
>> > It is easier to first do a normal search und "group by" afterwards:
>> >
>> > Iterate through the result documents and take one of each group. Each
>> > document has a groupingKey. I remember which groupingKey is already used
>> > and don't take another document of this group into the result list.
>> >
>> > Regards,
>> > Nina
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702742.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> [hidden email]
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>



--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Reply | Threaded
Open this post in threaded view
|

Re: Group by in Lucene ?

Erick Erickson
In reply to this post by Marcus Herou
At a quick glance, this line is really suspicious:

Document document = this.indexReader.document(doc)

From the Javadoc for HitCollector.collect:

Note: This is called in an inner search loop. For good search performance,
implementations of this method should not call
Searcher.doc(int)<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Searcher.html#doc%28int%29>or
IndexReader.document(int)<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/index/IndexReader.html#document%28int%29>on
every document number encountered. Doing so can slow searches by an
order
of magnitude or more.

You're loading the document each time through the loop. I think you'd get
much better
performance by making sure that your groupField is indexed, then use
TermDocs (TermEnum?)
to get the value of the field.

Best
Erick



On Wed, Jan 28, 2009 at 6:43 AM, Marcus Herou <[hidden email]>wrote:

> Hi.
>
> This is way too slow I think since what you are explaining is something I
> already tested. However I might be using the HitCollector badly.
>
> Please prove me wrong. Supplying some code which I tested this with.
> It stores a hash of the value of the term in a TIntHashSet and just
> calculates the size of that set.
> This one takes approx 3 sec on about 0.5M rows = way too slow.
>
>
> main test class:
> public class GroupingTest
> {
>    protected static final Log log =
> LogFactory.getLog(GroupingTest.class.getName());
>    static DateFormat df = new SimpleDateFormat("yyyy-MM-dd");
>    public static void main(String[] args)
>    {
>        Utils.initLogger();
>        String[] fields =
> {"uid","ip","date","siteId","visits","countryCode"};
>        try
>        {
>            IndexFactory fact = new IndexFactory();
>            String d = "/tmp/csvtest";
>            fact.initDir(d);
>            IndexReader reader = fact.getReader(d);
>            IndexSearcher searcher = fact.getSearcher(d, reader);
>            QueryParser parser = new MultiFieldQueryParser(fields,
> fact.getAnalyzer());
>            Query q = parser.parse("date:20090125");
>
>
>            GroupingHitCollector coll = new GroupingHitCollector();
>            coll.setDistinct(true);
>            coll.setGroupField("uid");
>            coll.setIndexReader(reader);
>            long start = System.currentTimeMillis();
>            searcher.search(q, coll);
>            long stop = System.currentTimeMillis();
>            System.out.println("Time: " + (stop-start) + ", distinct
> count(uid):"+coll.getDistinctCount() + ", count(uid): "+coll.getCount());
>        }
>        catch (Exception e)
>        {
>            log.error(e.toString(), e);
>        }
>    }
> }
>
>
> public class GroupingHitCollector  extends HitCollector
> {
>    protected IndexReader indexReader;
>    protected String groupField;
>    protected boolean distinct;
>    //protected TLongHashSet set;
>    protected TIntHashSet set;
>    protected int distinctSize;
>
>    int count = 0;
>    int sum = 0;
>
>    public GroupingHitCollector()
>    {
>        set = new TIntHashSet();
>    }
>
>    public String getGroupField()
>    {
>        return groupField;
>    }
>
>    public void setGroupField(String groupField)
>    {
>        this.groupField = groupField;
>    }
>
>    public IndexReader getIndexReader()
>    {
>        return indexReader;
>    }
>
>    public void setIndexReader(IndexReader indexReader)
>    {
>        this.indexReader = indexReader;
>    }
>
>    public boolean isDistinct()
>    {
>        return distinct;
>    }
>
>    public void setDistinct(boolean distinct)
>    {
>        this.distinct = distinct;
>    }
>
>    public void collect(int doc, float score)
>    {
>        if(distinct)
>        {
>            try
>            {
>                Document document = this.indexReader.document(doc);
>                if(document != null)
>                {
>                    String s = document.get(groupField);
>                    if(s != null)
>                    {
>                        set.add(s.hashCode());
>                        //set.add(Crc64.generate(s));
>                    }
>                }
>            }
>            catch (IOException e)
>            {
>                e.printStackTrace();
>            }
>        }
>        count++;
>        sum += doc;  // use it to avoid any possibility of being optimized
> away
>    }
>
>    public int getCount() { return count; }
>    public int getSum() { return sum; }
>
>    public int getDistinctCount()
>    {
>        distinctSize = set.size();
>        return distinctSize;
>     }
> }
>
>
> On Wed, Jan 28, 2009 at 10:51 AM, ninaS <[hidden email]> wrote:
>
> >
> > By the way: if you only need to count documents (count groups)
> HitCollector
> > is a good choice. If you only count you don't need to sort anything.
> >
> >
> > ninaS wrote:
> > >
> > > Hello,
> > >
> > > yes I tried HitCollector but I am not satisfied with it because you can
> > > not use sorting with HitCollector unless you implement a way to use
> > > TopFieldTocCollector. I did not manage to do that in a performant way.
> > >
> > > It is easier to first do a normal search und "group by" afterwards:
> > >
> > > Iterate through the result documents and take one of each group. Each
> > > document has a groupingKey. I remember which groupingKey is already
> used
> > > and don't take another document of this group into the result list.
> > >
> > > Regards,
> > > Nina
> > >
> >
> > --
> > View this message in context:
> > http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702742.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> [hidden email]
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: Group by in Lucene ?

Mark Miller-3
Group-by in Lucene/Solr has not been solved in a great general way yet
to my knowledge.

Ideally, we would want a solution that does not need to fit into memory.
However, you need the value of the field for each document. to do the
grouping As you are finding, this is not cheap to get. Currently, the
efficient way to get it is to use a FieldCache. This, however, requires
that every distinct value can fit into memory.

Once you have efficient access to the values, you need to be able to
efficiently group the results, again not bounded by memory (which we
already are with the FieldCache).

There are quite a few ways to do this. The simplest is to group until
you have used all the memory you want, then for everything left,
anything that doesnt match a group, write it to a file, if it does,
increment the group count. Use the overflow file as the input in the
next run, repeat until there is no overflow. You can improve on that by
partitioning the overflow file.

And then there are a dozen other methods.

Solr has a patch in JIRA that uses a sorting method. First the results
are sorted on the group-by field, then scanned through for grouping -
all field values that are the same will be next to each other. Finally,
if you really wanted to sort on a different field, another sort is
applied. Thats not ideal IMO, but its a start.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Group by in Lucene ?

Marcus Herou
In reply to this post by Erick Erickson
Yep, you are correct, this is a lousy implementation which I knew when I
wrote it.

I'm not interested in the entire document just the grouping term and the
docId which it is connected to.

So how do I get hold of the TermDocs for the grouping field ?

I mean I probably first need to perform the query: searcher.search(...)
which would give me set of doc ids. Then I need to group them all by for
instance: "ip-address", save each ip-address in another set and in the end
calculate the size of that set.

i.e the equiv of: select count(distinct(ipAddress)) from AccessLog where
date='2009-01-25' (optionally group by ipAddress ?)


//Marcus





On Wed, Jan 28, 2009 at 3:02 PM, Erick Erickson <[hidden email]>wrote:

> At a quick glance, this line is really suspicious:
>
> Document document = this.indexReader.document(doc)
>
> From the Javadoc for HitCollector.collect:
>
> Note: This is called in an inner search loop. For good search performance,
> implementations of this method should not call
>
> Searcher.doc(int)<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Searcher.html#doc%28int%29>or
>
> IndexReader.document(int)<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/index/IndexReader.html#document%28int%29>on
> every document number encountered. Doing so can slow searches by an
> order
> of magnitude or more.
>
> You're loading the document each time through the loop. I think you'd get
> much better
> performance by making sure that your groupField is indexed, then use
> TermDocs (TermEnum?)
> to get the value of the field.
>
> Best
> Erick
>
>
>
> On Wed, Jan 28, 2009 at 6:43 AM, Marcus Herou <[hidden email]
> >wrote:
>
> > Hi.
> >
> > This is way too slow I think since what you are explaining is something I
> > already tested. However I might be using the HitCollector badly.
> >
> > Please prove me wrong. Supplying some code which I tested this with.
> > It stores a hash of the value of the term in a TIntHashSet and just
> > calculates the size of that set.
> > This one takes approx 3 sec on about 0.5M rows = way too slow.
> >
> >
> > main test class:
> > public class GroupingTest
> > {
> >    protected static final Log log =
> > LogFactory.getLog(GroupingTest.class.getName());
> >    static DateFormat df = new SimpleDateFormat("yyyy-MM-dd");
> >    public static void main(String[] args)
> >    {
> >        Utils.initLogger();
> >        String[] fields =
> > {"uid","ip","date","siteId","visits","countryCode"};
> >        try
> >        {
> >            IndexFactory fact = new IndexFactory();
> >            String d = "/tmp/csvtest";
> >            fact.initDir(d);
> >            IndexReader reader = fact.getReader(d);
> >            IndexSearcher searcher = fact.getSearcher(d, reader);
> >            QueryParser parser = new MultiFieldQueryParser(fields,
> > fact.getAnalyzer());
> >            Query q = parser.parse("date:20090125");
> >
> >
> >            GroupingHitCollector coll = new GroupingHitCollector();
> >            coll.setDistinct(true);
> >            coll.setGroupField("uid");
> >            coll.setIndexReader(reader);
> >            long start = System.currentTimeMillis();
> >            searcher.search(q, coll);
> >            long stop = System.currentTimeMillis();
> >            System.out.println("Time: " + (stop-start) + ", distinct
> > count(uid):"+coll.getDistinctCount() + ", count(uid): "+coll.getCount());
> >        }
> >        catch (Exception e)
> >        {
> >            log.error(e.toString(), e);
> >        }
> >    }
> > }
> >
> >
> > public class GroupingHitCollector  extends HitCollector
> > {
> >    protected IndexReader indexReader;
> >    protected String groupField;
> >    protected boolean distinct;
> >    //protected TLongHashSet set;
> >    protected TIntHashSet set;
> >    protected int distinctSize;
> >
> >    int count = 0;
> >    int sum = 0;
> >
> >    public GroupingHitCollector()
> >    {
> >        set = new TIntHashSet();
> >    }
> >
> >    public String getGroupField()
> >    {
> >        return groupField;
> >    }
> >
> >    public void setGroupField(String groupField)
> >    {
> >        this.groupField = groupField;
> >    }
> >
> >    public IndexReader getIndexReader()
> >    {
> >        return indexReader;
> >    }
> >
> >    public void setIndexReader(IndexReader indexReader)
> >    {
> >        this.indexReader = indexReader;
> >    }
> >
> >    public boolean isDistinct()
> >    {
> >        return distinct;
> >    }
> >
> >    public void setDistinct(boolean distinct)
> >    {
> >        this.distinct = distinct;
> >    }
> >
> >    public void collect(int doc, float score)
> >    {
> >        if(distinct)
> >        {
> >            try
> >            {
> >                Document document = this.indexReader.document(doc);
> >                if(document != null)
> >                {
> >                    String s = document.get(groupField);
> >                    if(s != null)
> >                    {
> >                        set.add(s.hashCode());
> >                        //set.add(Crc64.generate(s));
> >                    }
> >                }
> >            }
> >            catch (IOException e)
> >            {
> >                e.printStackTrace();
> >            }
> >        }
> >        count++;
> >        sum += doc;  // use it to avoid any possibility of being optimized
> > away
> >    }
> >
> >    public int getCount() { return count; }
> >    public int getSum() { return sum; }
> >
> >    public int getDistinctCount()
> >    {
> >        distinctSize = set.size();
> >        return distinctSize;
> >     }
> > }
> >
> >
> > On Wed, Jan 28, 2009 at 10:51 AM, ninaS <[hidden email]> wrote:
> >
> > >
> > > By the way: if you only need to count documents (count groups)
> > HitCollector
> > > is a good choice. If you only count you don't need to sort anything.
> > >
> > >
> > > ninaS wrote:
> > > >
> > > > Hello,
> > > >
> > > > yes I tried HitCollector but I am not satisfied with it because you
> can
> > > > not use sorting with HitCollector unless you implement a way to use
> > > > TopFieldTocCollector. I did not manage to do that in a performant
> way.
> > > >
> > > > It is easier to first do a normal search und "group by" afterwards:
> > > >
> > > > Iterate through the result documents and take one of each group. Each
> > > > document has a groupingKey. I remember which groupingKey is already
> > used
> > > > and don't take another document of this group into the result list.
> > > >
> > > > Regards,
> > > > Nina
> > > >
> > >
> > > --
> > > View this message in context:
> > > http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702742.html
> > > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > [hidden email]
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
> >
>



--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Reply | Threaded
Open this post in threaded view
|

Re: Group by in Lucene ?

Marcus Herou
In reply to this post by Mark Miller-3
Yep. Probably an external sort should be used when flushing to disk. I have
written such code so that is probably a no brainer, the problem is to get it
speedy :)
<http://dev.tailsweep.com/projects/utils/apidocs/org/tailsweep/utils/sort/TupleSorter.html>
http://dev.tailsweep.com/projects/utils/apidocs/com/tailsweep/utils/sort/TupleSorter.html

Another way could be to use HDFS and MapFiles/SequenceFiles Not speedy at
all but scalable.

Thinking of writing my own Inverted Index, specialized for these kind of
operations. Any pointers in where to start look for material for that ?

/Marcus























On Wed, Jan 28, 2009 at 5:02 PM, Mark Miller <[hidden email]> wrote:

> Group-by in Lucene/Solr has not been solved in a great general way yet to
> my knowledge.
>
> Ideally, we would want a solution that does not need to fit into memory.
> However, you need the value of the field for each document. to do the
> grouping As you are finding, this is not cheap to get. Currently, the
> efficient way to get it is to use a FieldCache. This, however, requires that
> every distinct value can fit into memory.
>
> Once you have efficient access to the values, you need to be able to
> efficiently group the results, again not bounded by memory (which we already
> are with the FieldCache).
>
> There are quite a few ways to do this. The simplest is to group until you
> have used all the memory you want, then for everything left, anything that
> doesnt match a group, write it to a file, if it does, increment the group
> count. Use the overflow file as the input in the next run, repeat until
> there is no overflow. You can improve on that by partitioning the overflow
> file.
>
> And then there are a dozen other methods.
>
> Solr has a patch in JIRA that uses a sorting method. First the results are
> sorted on the group-by field, then scanned through for grouping - all field
> values that are the same will be next to each other. Finally, if you really
> wanted to sort on a different field, another sort is applied. Thats not
> ideal IMO, but its a start.
>
> - Mark
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Reply | Threaded
Open this post in threaded view
|

Re: Group by in Lucene ?

mschipperheyn
http://code.google.com/p/bobo-browse

looks like it may be the ticket.

Marc
Reply | Threaded
Open this post in threaded view
|

Re: Group by in Lucene ?

Erik Hatcher
Don't overlook Solr: http://lucene.apache.org/solr

        Erik

On Aug 1, 2009, at 5:43 AM, mschipperheyn wrote:

>
> http://code.google.com/p/bobo-browse
>
> looks like it may be the ticket.
>
> Marc
>
> --
> View this message in context: http://www.nabble.com/Group-by-in-Lucene---tp13581760p24767693.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]