Wikipedia or reuters like index for testing facets?

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Wikipedia or reuters like index for testing facets?

Jason Rutherglen
Is there a standard index like what Lucene uses for contrib/benchmark for
executing faceted queries over? Or maybe we can randomly generate one that
works in conjunction with wikipedia? That way we can execute real world
queries against faceted data. Or we could use the Lucene/Solr mailing lists
and other data (ala Lucid's faceted site) as a standard index?
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia or reuters like index for testing facets?

Mark Miller-3
On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
[hidden email]> wrote:

> Is there a standard index like what Lucene uses for contrib/benchmark for
> executing faceted queries over? Or maybe we can randomly generate one that
> works in conjunction with wikipedia? That way we can execute real world
> queries against faceted data. Or we could use the Lucene/Solr mailing lists
> and other data (ala Lucid's faceted site) as a standard index?
>

I don't think there is any standard set of docs for solr testing - there is
not a real benchmark contrib - though I know more than a few of us have
hacked up pieces of Lucene benchmark to work with Solr - I think I've done
it twice now ;)

Would be nice to get things going. I was thinking the other day: I wonder
how hard it would be to make Lucene Benchmark generic enough to accept Solr
impls and Solr algs?

It does a lot that would suck to duplicate.

--
--
- Mark

http://www.lucidimagination.com
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia or reuters like index for testing facets?

Grant Ingersoll-2
At a min, it is trivial to use the EnWikiDocMaker and then send the  
doc over SolrJ...

On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:

> On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
> [hidden email]> wrote:
>
>> Is there a standard index like what Lucene uses for contrib/
>> benchmark for
>> executing faceted queries over? Or maybe we can randomly generate  
>> one that
>> works in conjunction with wikipedia? That way we can execute real  
>> world
>> queries against faceted data. Or we could use the Lucene/Solr  
>> mailing lists
>> and other data (ala Lucid's faceted site) as a standard index?
>>
>
> I don't think there is any standard set of docs for solr testing -  
> there is
> not a real benchmark contrib - though I know more than a few of us  
> have
> hacked up pieces of Lucene benchmark to work with Solr - I think  
> I've done
> it twice now ;)
>
> Would be nice to get things going. I was thinking the other day: I  
> wonder
> how hard it would be to make Lucene Benchmark generic enough to  
> accept Solr
> impls and Solr algs?
>
> It does a lot that would suck to duplicate.
>
> --
> --
> - Mark
>
> http://www.lucidimagination.com

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia or reuters like index for testing facets?

Jason Rutherglen
You think enwiki has enough data for faceting?

On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<[hidden email]> wrote:

> At a min, it is trivial to use the EnWikiDocMaker and then send the doc over
> SolrJ...
>
> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>
>> On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>> [hidden email]> wrote:
>>
>>> Is there a standard index like what Lucene uses for contrib/benchmark for
>>> executing faceted queries over? Or maybe we can randomly generate one
>>> that
>>> works in conjunction with wikipedia? That way we can execute real world
>>> queries against faceted data. Or we could use the Lucene/Solr mailing
>>> lists
>>> and other data (ala Lucid's faceted site) as a standard index?
>>>
>>
>> I don't think there is any standard set of docs for solr testing - there
>> is
>> not a real benchmark contrib - though I know more than a few of us have
>> hacked up pieces of Lucene benchmark to work with Solr - I think I've done
>> it twice now ;)
>>
>> Would be nice to get things going. I was thinking the other day: I wonder
>> how hard it would be to make Lucene Benchmark generic enough to accept
>> Solr
>> impls and Solr algs?
>>
>> It does a lot that would suck to duplicate.
>>
>> --
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia or reuters like index for testing facets?

Grant Ingersoll-2
Probably not as generated by the EnwikiDocMaker, but the  
WikipediaTokenizer in Lucene can pull out richer syntax which could  
then be Teed/Sinked to other fields.  Things like categories, related  
links, etc.  Mostly, though, I was just commenting on the fact that it  
isn't hard to at least use it for getting docs into Solr.

-Grant
On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:

> You think enwiki has enough data for faceting?
>
> On Tue, Jul 14, 2009 at 2:56 PM, Grant  
> Ingersoll<[hidden email]> wrote:
>> At a min, it is trivial to use the EnWikiDocMaker and then send the  
>> doc over
>> SolrJ...
>>
>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>
>>> On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>> [hidden email]> wrote:
>>>
>>>> Is there a standard index like what Lucene uses for contrib/
>>>> benchmark for
>>>> executing faceted queries over? Or maybe we can randomly generate  
>>>> one
>>>> that
>>>> works in conjunction with wikipedia? That way we can execute real  
>>>> world
>>>> queries against faceted data. Or we could use the Lucene/Solr  
>>>> mailing
>>>> lists
>>>> and other data (ala Lucid's faceted site) as a standard index?
>>>>
>>>
>>> I don't think there is any standard set of docs for solr testing -  
>>> there
>>> is
>>> not a real benchmark contrib - though I know more than a few of us  
>>> have
>>> hacked up pieces of Lucene benchmark to work with Solr - I think  
>>> I've done
>>> it twice now ;)
>>>
>>> Would be nice to get things going. I was thinking the other day: I  
>>> wonder
>>> how hard it would be to make Lucene Benchmark generic enough to  
>>> accept
>>> Solr
>>> impls and Solr algs?
>>>
>>> It does a lot that would suck to duplicate.
>>>
>>> --
>>> --
>>> - Mark
>>>
>>> http://www.lucidimagination.com
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia or reuters like index for testing facets?

Mark Miller-3
Why don't you just randomly generate the facet data? Thats prob the best way
right? You can control the uniques and ranges.

On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll <[hidden email]>wrote:

> Probably not as generated by the EnwikiDocMaker, but the WikipediaTokenizer
> in Lucene can pull out richer syntax which could then be Teed/Sinked to
> other fields.  Things like categories, related links, etc.  Mostly, though,
> I was just commenting on the fact that it isn't hard to at least use it for
> getting docs into Solr.
>
> -Grant
>
> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:
>
>  You think enwiki has enough data for faceting?
>>
>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<[hidden email]>
>> wrote:
>>
>>> At a min, it is trivial to use the EnWikiDocMaker and then send the doc
>>> over
>>> SolrJ...
>>>
>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>>
>>>  On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>>> [hidden email]> wrote:
>>>>
>>>>  Is there a standard index like what Lucene uses for contrib/benchmark
>>>>> for
>>>>> executing faceted queries over? Or maybe we can randomly generate one
>>>>> that
>>>>> works in conjunction with wikipedia? That way we can execute real world
>>>>> queries against faceted data. Or we could use the Lucene/Solr mailing
>>>>> lists
>>>>> and other data (ala Lucid's faceted site) as a standard index?
>>>>>
>>>>>
>>>> I don't think there is any standard set of docs for solr testing - there
>>>> is
>>>> not a real benchmark contrib - though I know more than a few of us have
>>>> hacked up pieces of Lucene benchmark to work with Solr - I think I've
>>>> done
>>>> it twice now ;)
>>>>
>>>> Would be nice to get things going. I was thinking the other day: I
>>>> wonder
>>>> how hard it would be to make Lucene Benchmark generic enough to accept
>>>> Solr
>>>> impls and Solr algs?
>>>>
>>>> It does a lot that would suck to duplicate.
>>>>
>>>> --
>>>> --
>>>> - Mark
>>>>
>>>> http://www.lucidimagination.com
>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>> Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


--
--
- Mark

http://www.lucidimagination.com
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia or reuters like index for testing facets?

Jason Rutherglen
Yeah that's what I was thinking of as an alternative, use enwiki
and randomly generate facet data along with it. However for
consistent benchmarking the random data would need to stay the
same so that people could execute the same benchmark
consistently in their own environment.

On Tue, Jul 14, 2009 at 6:28 PM, Mark Miller<[hidden email]> wrote:

> Why don't you just randomly generate the facet data? Thats prob the best way
> right? You can control the uniques and ranges.
>
> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll <[hidden email]>wrote:
>
>> Probably not as generated by the EnwikiDocMaker, but the WikipediaTokenizer
>> in Lucene can pull out richer syntax which could then be Teed/Sinked to
>> other fields.  Things like categories, related links, etc.  Mostly, though,
>> I was just commenting on the fact that it isn't hard to at least use it for
>> getting docs into Solr.
>>
>> -Grant
>>
>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:
>>
>>  You think enwiki has enough data for faceting?
>>>
>>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<[hidden email]>
>>> wrote:
>>>
>>>> At a min, it is trivial to use the EnWikiDocMaker and then send the doc
>>>> over
>>>> SolrJ...
>>>>
>>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>>>
>>>>  On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>>>> [hidden email]> wrote:
>>>>>
>>>>>  Is there a standard index like what Lucene uses for contrib/benchmark
>>>>>> for
>>>>>> executing faceted queries over? Or maybe we can randomly generate one
>>>>>> that
>>>>>> works in conjunction with wikipedia? That way we can execute real world
>>>>>> queries against faceted data. Or we could use the Lucene/Solr mailing
>>>>>> lists
>>>>>> and other data (ala Lucid's faceted site) as a standard index?
>>>>>>
>>>>>>
>>>>> I don't think there is any standard set of docs for solr testing - there
>>>>> is
>>>>> not a real benchmark contrib - though I know more than a few of us have
>>>>> hacked up pieces of Lucene benchmark to work with Solr - I think I've
>>>>> done
>>>>> it twice now ;)
>>>>>
>>>>> Would be nice to get things going. I was thinking the other day: I
>>>>> wonder
>>>>> how hard it would be to make Lucene Benchmark generic enough to accept
>>>>> Solr
>>>>> impls and Solr algs?
>>>>>
>>>>> It does a lot that would suck to duplicate.
>>>>>
>>>>> --
>>>>> --
>>>>> - Mark
>>>>>
>>>>> http://www.lucidimagination.com
>>>>>
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>>
>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>>> Solr/Lucene:
>>>> http://www.lucidimagination.com/search
>>>>
>>>>
>>>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>
>
> --
> --
> - Mark
>
> http://www.lucidimagination.com
>
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia or reuters like index for testing facets?

Peter Wolanin-2
AWS provides some standard data sets, including an extract of all
wikipedia content:

http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249

Looks like it's not being updated often, so this or another AWS data
set could be a consistent basis for benchmarking?

-Peter

On Wed, Jul 15, 2009 at 2:21 PM, Jason
Rutherglen<[hidden email]> wrote:

> Yeah that's what I was thinking of as an alternative, use enwiki
> and randomly generate facet data along with it. However for
> consistent benchmarking the random data would need to stay the
> same so that people could execute the same benchmark
> consistently in their own environment.
>
> On Tue, Jul 14, 2009 at 6:28 PM, Mark Miller<[hidden email]> wrote:
>> Why don't you just randomly generate the facet data? Thats prob the best way
>> right? You can control the uniques and ranges.
>>
>> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll <[hidden email]>wrote:
>>
>>> Probably not as generated by the EnwikiDocMaker, but the WikipediaTokenizer
>>> in Lucene can pull out richer syntax which could then be Teed/Sinked to
>>> other fields.  Things like categories, related links, etc.  Mostly, though,
>>> I was just commenting on the fact that it isn't hard to at least use it for
>>> getting docs into Solr.
>>>
>>> -Grant
>>>
>>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:
>>>
>>>  You think enwiki has enough data for faceting?
>>>>
>>>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<[hidden email]>
>>>> wrote:
>>>>
>>>>> At a min, it is trivial to use the EnWikiDocMaker and then send the doc
>>>>> over
>>>>> SolrJ...
>>>>>
>>>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>>>>
>>>>>  On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>>>>> [hidden email]> wrote:
>>>>>>
>>>>>>  Is there a standard index like what Lucene uses for contrib/benchmark
>>>>>>> for
>>>>>>> executing faceted queries over? Or maybe we can randomly generate one
>>>>>>> that
>>>>>>> works in conjunction with wikipedia? That way we can execute real world
>>>>>>> queries against faceted data. Or we could use the Lucene/Solr mailing
>>>>>>> lists
>>>>>>> and other data (ala Lucid's faceted site) as a standard index?
>>>>>>>
>>>>>>>
>>>>>> I don't think there is any standard set of docs for solr testing - there
>>>>>> is
>>>>>> not a real benchmark contrib - though I know more than a few of us have
>>>>>> hacked up pieces of Lucene benchmark to work with Solr - I think I've
>>>>>> done
>>>>>> it twice now ;)
>>>>>>
>>>>>> Would be nice to get things going. I was thinking the other day: I
>>>>>> wonder
>>>>>> how hard it would be to make Lucene Benchmark generic enough to accept
>>>>>> Solr
>>>>>> impls and Solr algs?
>>>>>>
>>>>>> It does a lot that would suck to duplicate.
>>>>>>
>>>>>> --
>>>>>> --
>>>>>> - Mark
>>>>>>
>>>>>> http://www.lucidimagination.com
>>>>>>
>>>>>
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.com/
>>>>>
>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>>>> Solr/Lucene:
>>>>> http://www.lucidimagination.com/search
>>>>>
>>>>>
>>>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>> Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>
>>
>> --
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>



--
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia or reuters like index for testing facets?

Jason Rutherglen
The question that comes to mind is how it's different than
http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2

Guess we'd need to download it and take a look!

On Thu, Jul 16, 2009 at 8:33 PM, Peter Wolanin<[hidden email]> wrote:

> AWS provides some standard data sets, including an extract of all
> wikipedia content:
>
> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249
>
> Looks like it's not being updated often, so this or another AWS data
> set could be a consistent basis for benchmarking?
>
> -Peter
>
> On Wed, Jul 15, 2009 at 2:21 PM, Jason
> Rutherglen<[hidden email]> wrote:
>> Yeah that's what I was thinking of as an alternative, use enwiki
>> and randomly generate facet data along with it. However for
>> consistent benchmarking the random data would need to stay the
>> same so that people could execute the same benchmark
>> consistently in their own environment.
>>
>> On Tue, Jul 14, 2009 at 6:28 PM, Mark Miller<[hidden email]> wrote:
>>> Why don't you just randomly generate the facet data? Thats prob the best way
>>> right? You can control the uniques and ranges.
>>>
>>> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll <[hidden email]>wrote:
>>>
>>>> Probably not as generated by the EnwikiDocMaker, but the WikipediaTokenizer
>>>> in Lucene can pull out richer syntax which could then be Teed/Sinked to
>>>> other fields.  Things like categories, related links, etc.  Mostly, though,
>>>> I was just commenting on the fact that it isn't hard to at least use it for
>>>> getting docs into Solr.
>>>>
>>>> -Grant
>>>>
>>>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:
>>>>
>>>>  You think enwiki has enough data for faceting?
>>>>>
>>>>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<[hidden email]>
>>>>> wrote:
>>>>>
>>>>>> At a min, it is trivial to use the EnWikiDocMaker and then send the doc
>>>>>> over
>>>>>> SolrJ...
>>>>>>
>>>>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>>>>>
>>>>>>  On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>>>>>> [hidden email]> wrote:
>>>>>>>
>>>>>>>  Is there a standard index like what Lucene uses for contrib/benchmark
>>>>>>>> for
>>>>>>>> executing faceted queries over? Or maybe we can randomly generate one
>>>>>>>> that
>>>>>>>> works in conjunction with wikipedia? That way we can execute real world
>>>>>>>> queries against faceted data. Or we could use the Lucene/Solr mailing
>>>>>>>> lists
>>>>>>>> and other data (ala Lucid's faceted site) as a standard index?
>>>>>>>>
>>>>>>>>
>>>>>>> I don't think there is any standard set of docs for solr testing - there
>>>>>>> is
>>>>>>> not a real benchmark contrib - though I know more than a few of us have
>>>>>>> hacked up pieces of Lucene benchmark to work with Solr - I think I've
>>>>>>> done
>>>>>>> it twice now ;)
>>>>>>>
>>>>>>> Would be nice to get things going. I was thinking the other day: I
>>>>>>> wonder
>>>>>>> how hard it would be to make Lucene Benchmark generic enough to accept
>>>>>>> Solr
>>>>>>> impls and Solr algs?
>>>>>>>
>>>>>>> It does a lot that would suck to duplicate.
>>>>>>>
>>>>>>> --
>>>>>>> --
>>>>>>> - Mark
>>>>>>>
>>>>>>> http://www.lucidimagination.com
>>>>>>>
>>>>>>
>>>>>> --------------------------
>>>>>> Grant Ingersoll
>>>>>> http://www.lucidimagination.com/
>>>>>>
>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>>>>> Solr/Lucene:
>>>>>> http://www.lucidimagination.com/search
>>>>>>
>>>>>>
>>>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>>
>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>>> Solr/Lucene:
>>>> http://www.lucidimagination.com/search
>>>>
>>>>
>>>
>>>
>>> --
>>> --
>>> - Mark
>>>
>>> http://www.lucidimagination.com
>>>
>>
>
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> [hidden email]
>
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia or reuters like index for testing facets?

Grant Ingersoll-2
It's likely quite different.  That link is meant to be stable for  
benchmarking purposes within Lucene.

Note, one think I wish I had time for:
Hook in Tee/Sink capabilities into Solr such that one could use the  
WikipediaTokenizer and then Tee the Categories, etc. off to separate  
fields automatically for faceting, etc.

-Grant

On Jul 17, 2009, at 10:48 AM, Jason Rutherglen wrote:

> The question that comes to mind is how it's different than
> http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2
>
> Guess we'd need to download it and take a look!
>
> On Thu, Jul 16, 2009 at 8:33 PM, Peter Wolanin<[hidden email]
> > wrote:
>> AWS provides some standard data sets, including an extract of all
>> wikipedia content:
>>
>> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249
>>
>> Looks like it's not being updated often, so this or another AWS data
>> set could be a consistent basis for benchmarking?
>>
>> -Peter
>>
>> On Wed, Jul 15, 2009 at 2:21 PM, Jason
>> Rutherglen<[hidden email]> wrote:
>>> Yeah that's what I was thinking of as an alternative, use enwiki
>>> and randomly generate facet data along with it. However for
>>> consistent benchmarking the random data would need to stay the
>>> same so that people could execute the same benchmark
>>> consistently in their own environment.
>>>
>>> On Tue, Jul 14, 2009 at 6:28 PM, Mark  
>>> Miller<[hidden email]> wrote:
>>>> Why don't you just randomly generate the facet data? Thats prob  
>>>> the best way
>>>> right? You can control the uniques and ranges.
>>>>
>>>> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll <[hidden email]
>>>> >wrote:
>>>>
>>>>> Probably not as generated by the EnwikiDocMaker, but the  
>>>>> WikipediaTokenizer
>>>>> in Lucene can pull out richer syntax which could then be Teed/
>>>>> Sinked to
>>>>> other fields.  Things like categories, related links, etc.  
>>>>> Mostly, though,
>>>>> I was just commenting on the fact that it isn't hard to at least  
>>>>> use it for
>>>>> getting docs into Solr.
>>>>>
>>>>> -Grant
>>>>>
>>>>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:
>>>>>
>>>>>  You think enwiki has enough data for faceting?
>>>>>>
>>>>>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<[hidden email]
>>>>>> >
>>>>>> wrote:
>>>>>>
>>>>>>> At a min, it is trivial to use the EnWikiDocMaker and then  
>>>>>>> send the doc
>>>>>>> over
>>>>>>> SolrJ...
>>>>>>>
>>>>>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>>>>>>
>>>>>>>  On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>>>>>>> [hidden email]> wrote:
>>>>>>>>
>>>>>>>>  Is there a standard index like what Lucene uses for contrib/
>>>>>>>> benchmark
>>>>>>>>> for
>>>>>>>>> executing faceted queries over? Or maybe we can randomly  
>>>>>>>>> generate one
>>>>>>>>> that
>>>>>>>>> works in conjunction with wikipedia? That way we can execute  
>>>>>>>>> real world
>>>>>>>>> queries against faceted data. Or we could use the Lucene/
>>>>>>>>> Solr mailing
>>>>>>>>> lists
>>>>>>>>> and other data (ala Lucid's faceted site) as a standard index?
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I don't think there is any standard set of docs for solr  
>>>>>>>> testing - there
>>>>>>>> is
>>>>>>>> not a real benchmark contrib - though I know more than a few  
>>>>>>>> of us have
>>>>>>>> hacked up pieces of Lucene benchmark to work with Solr - I  
>>>>>>>> think I've
>>>>>>>> done
>>>>>>>> it twice now ;)
>>>>>>>>
>>>>>>>> Would be nice to get things going. I was thinking the other  
>>>>>>>> day: I
>>>>>>>> wonder
>>>>>>>> how hard it would be to make Lucene Benchmark generic enough  
>>>>>>>> to accept
>>>>>>>> Solr
>>>>>>>> impls and Solr algs?
>>>>>>>>
>>>>>>>> It does a lot that would suck to duplicate.
>>>>>>>>
>>>>>>>> --
>>>>>>>> --
>>>>>>>> - Mark
>>>>>>>>
>>>>>>>> http://www.lucidimagination.com
>>>>>>>>
>>>>>>>
>>>>>>> --------------------------
>>>>>>> Grant Ingersoll
>>>>>>> http://www.lucidimagination.com/
>>>>>>>
>>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/
>>>>>>> Droids) using
>>>>>>> Solr/Lucene:
>>>>>>> http://www.lucidimagination.com/search
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.com/
>>>>>
>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/
>>>>> Droids) using
>>>>> Solr/Lucene:
>>>>> http://www.lucidimagination.com/search
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> --
>>>> - Mark
>>>>
>>>> http://www.lucidimagination.com
>>>>
>>>
>>
>>
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> [hidden email]
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia or reuters like index for testing facets?

Jason Rutherglen
I saw the discussion about TeeSinkTokenFilter on java-user, and
was wondering how Solr performs copy fields? Couldn't Solr by
default utilize a TeeSinkTokenFilter like class for copying
fields?

> That link is meant to be stable for benchmarking purposes within Lucene.

The fields are different?

On Fri, Jul 17, 2009 at 9:57 AM, Grant Ingersoll<[hidden email]> wrote:

> It's likely quite different.  That link is meant to be stable for
> benchmarking purposes within Lucene.
>
> Note, one think I wish I had time for:
> Hook in Tee/Sink capabilities into Solr such that one could use the
> WikipediaTokenizer and then Tee the Categories, etc. off to separate fields
> automatically for faceting, etc.
>
> -Grant
>
> On Jul 17, 2009, at 10:48 AM, Jason Rutherglen wrote:
>
>> The question that comes to mind is how it's different than
>>
>> http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2
>>
>> Guess we'd need to download it and take a look!
>>
>> On Thu, Jul 16, 2009 at 8:33 PM, Peter Wolanin<[hidden email]>
>> wrote:
>>>
>>> AWS provides some standard data sets, including an extract of all
>>> wikipedia content:
>>>
>>>
>>> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249
>>>
>>> Looks like it's not being updated often, so this or another AWS data
>>> set could be a consistent basis for benchmarking?
>>>
>>> -Peter
>>>
>>> On Wed, Jul 15, 2009 at 2:21 PM, Jason
>>> Rutherglen<[hidden email]> wrote:
>>>>
>>>> Yeah that's what I was thinking of as an alternative, use enwiki
>>>> and randomly generate facet data along with it. However for
>>>> consistent benchmarking the random data would need to stay the
>>>> same so that people could execute the same benchmark
>>>> consistently in their own environment.
>>>>
>>>> On Tue, Jul 14, 2009 at 6:28 PM, Mark Miller<[hidden email]>
>>>> wrote:
>>>>>
>>>>> Why don't you just randomly generate the facet data? Thats prob the
>>>>> best way
>>>>> right? You can control the uniques and ranges.
>>>>>
>>>>> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll
>>>>> <[hidden email]>wrote:
>>>>>
>>>>>> Probably not as generated by the EnwikiDocMaker, but the
>>>>>> WikipediaTokenizer
>>>>>> in Lucene can pull out richer syntax which could then be Teed/Sinked
>>>>>> to
>>>>>> other fields.  Things like categories, related links, etc.  Mostly,
>>>>>> though,
>>>>>> I was just commenting on the fact that it isn't hard to at least use
>>>>>> it for
>>>>>> getting docs into Solr.
>>>>>>
>>>>>> -Grant
>>>>>>
>>>>>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:
>>>>>>
>>>>>>  You think enwiki has enough data for faceting?
>>>>>>>
>>>>>>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<[hidden email]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> At a min, it is trivial to use the EnWikiDocMaker and then send the
>>>>>>>> doc
>>>>>>>> over
>>>>>>>> SolrJ...
>>>>>>>>
>>>>>>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>>>>>>>
>>>>>>>>  On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>>>>>>>>
>>>>>>>>> [hidden email]> wrote:
>>>>>>>>>
>>>>>>>>>  Is there a standard index like what Lucene uses for
>>>>>>>>> contrib/benchmark
>>>>>>>>>>
>>>>>>>>>> for
>>>>>>>>>> executing faceted queries over? Or maybe we can randomly generate
>>>>>>>>>> one
>>>>>>>>>> that
>>>>>>>>>> works in conjunction with wikipedia? That way we can execute real
>>>>>>>>>> world
>>>>>>>>>> queries against faceted data. Or we could use the Lucene/Solr
>>>>>>>>>> mailing
>>>>>>>>>> lists
>>>>>>>>>> and other data (ala Lucid's faceted site) as a standard index?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> I don't think there is any standard set of docs for solr testing -
>>>>>>>>> there
>>>>>>>>> is
>>>>>>>>> not a real benchmark contrib - though I know more than a few of us
>>>>>>>>> have
>>>>>>>>> hacked up pieces of Lucene benchmark to work with Solr - I think
>>>>>>>>> I've
>>>>>>>>> done
>>>>>>>>> it twice now ;)
>>>>>>>>>
>>>>>>>>> Would be nice to get things going. I was thinking the other day: I
>>>>>>>>> wonder
>>>>>>>>> how hard it would be to make Lucene Benchmark generic enough to
>>>>>>>>> accept
>>>>>>>>> Solr
>>>>>>>>> impls and Solr algs?
>>>>>>>>>
>>>>>>>>> It does a lot that would suck to duplicate.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> --
>>>>>>>>> - Mark
>>>>>>>>>
>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------
>>>>>>>> Grant Ingersoll
>>>>>>>> http://www.lucidimagination.com/
>>>>>>>>
>>>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>>>>>>> using
>>>>>>>> Solr/Lucene:
>>>>>>>> http://www.lucidimagination.com/search
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> --------------------------
>>>>>> Grant Ingersoll
>>>>>> http://www.lucidimagination.com/
>>>>>>
>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>>>>> using
>>>>>> Solr/Lucene:
>>>>>> http://www.lucidimagination.com/search
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --
>>>>> - Mark
>>>>>
>>>>> http://www.lucidimagination.com
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Peter M. Wolanin, Ph.D.
>>> Momentum Specialist,  Acquia. Inc.
>>> [hidden email]
>>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia or reuters like index for testing facets?

Grant Ingersoll-2
It's only really effective if the number of tokens in the Sink is  
expected to be significantly less than (my various tests showed around  
< 50%, but YMMV) so it isn't likely useful for most copy fields  
situations.  For Solr to utilize, the schema would have to allow for  
giving ids to the various TokenFilter's so that you could identify the  
Tees and the Sinks.  At least that was my first thought on it.

-Grant
On Jul 17, 2009, at 7:50 PM, Jason Rutherglen wrote:

> I saw the discussion about TeeSinkTokenFilter on java-user, and
> was wondering how Solr performs copy fields? Couldn't Solr by
> default utilize a TeeSinkTokenFilter like class for copying
> fields?
>
>> That link is meant to be stable for benchmarking purposes within  
>> Lucene.
>
> The fields are different?
>
> On Fri, Jul 17, 2009 at 9:57 AM, Grant  
> Ingersoll<[hidden email]> wrote:
>> It's likely quite different.  That link is meant to be stable for
>> benchmarking purposes within Lucene.
>>
>> Note, one think I wish I had time for:
>> Hook in Tee/Sink capabilities into Solr such that one could use the
>> WikipediaTokenizer and then Tee the Categories, etc. off to  
>> separate fields
>> automatically for faceting, etc.
>>
>> -Grant
>>
>> On Jul 17, 2009, at 10:48 AM, Jason Rutherglen wrote:
>>
>>> The question that comes to mind is how it's different than
>>>
>>> http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2
>>>
>>> Guess we'd need to download it and take a look!
>>>
>>> On Thu, Jul 16, 2009 at 8:33 PM, Peter Wolanin<[hidden email]
>>> >
>>> wrote:
>>>>
>>>> AWS provides some standard data sets, including an extract of all
>>>> wikipedia content:
>>>>
>>>>
>>>> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249
>>>>
>>>> Looks like it's not being updated often, so this or another AWS  
>>>> data
>>>> set could be a consistent basis for benchmarking?
>>>>
>>>> -Peter
>>>>
>>>> On Wed, Jul 15, 2009 at 2:21 PM, Jason
>>>> Rutherglen<[hidden email]> wrote:
>>>>>
>>>>> Yeah that's what I was thinking of as an alternative, use enwiki
>>>>> and randomly generate facet data along with it. However for
>>>>> consistent benchmarking the random data would need to stay the
>>>>> same so that people could execute the same benchmark
>>>>> consistently in their own environment.
>>>>>
>>>>> On Tue, Jul 14, 2009 at 6:28 PM, Mark  
>>>>> Miller<[hidden email]>
>>>>> wrote:
>>>>>>
>>>>>> Why don't you just randomly generate the facet data? Thats prob  
>>>>>> the
>>>>>> best way
>>>>>> right? You can control the uniques and ranges.
>>>>>>
>>>>>> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll
>>>>>> <[hidden email]>wrote:
>>>>>>
>>>>>>> Probably not as generated by the EnwikiDocMaker, but the
>>>>>>> WikipediaTokenizer
>>>>>>> in Lucene can pull out richer syntax which could then be Teed/
>>>>>>> Sinked
>>>>>>> to
>>>>>>> other fields.  Things like categories, related links, etc.  
>>>>>>> Mostly,
>>>>>>> though,
>>>>>>> I was just commenting on the fact that it isn't hard to at  
>>>>>>> least use
>>>>>>> it for
>>>>>>> getting docs into Solr.
>>>>>>>
>>>>>>> -Grant
>>>>>>>
>>>>>>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:
>>>>>>>
>>>>>>>  You think enwiki has enough data for faceting?
>>>>>>>>
>>>>>>>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<[hidden email]
>>>>>>>> >
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> At a min, it is trivial to use the EnWikiDocMaker and then  
>>>>>>>>> send the
>>>>>>>>> doc
>>>>>>>>> over
>>>>>>>>> SolrJ...
>>>>>>>>>
>>>>>>>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>>>>>>>>
>>>>>>>>>  On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>>>>>>>>>
>>>>>>>>>> [hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>>  Is there a standard index like what Lucene uses for
>>>>>>>>>> contrib/benchmark
>>>>>>>>>>>
>>>>>>>>>>> for
>>>>>>>>>>> executing faceted queries over? Or maybe we can randomly  
>>>>>>>>>>> generate
>>>>>>>>>>> one
>>>>>>>>>>> that
>>>>>>>>>>> works in conjunction with wikipedia? That way we can  
>>>>>>>>>>> execute real
>>>>>>>>>>> world
>>>>>>>>>>> queries against faceted data. Or we could use the Lucene/
>>>>>>>>>>> Solr
>>>>>>>>>>> mailing
>>>>>>>>>>> lists
>>>>>>>>>>> and other data (ala Lucid's faceted site) as a standard  
>>>>>>>>>>> index?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> I don't think there is any standard set of docs for solr  
>>>>>>>>>> testing -
>>>>>>>>>> there
>>>>>>>>>> is
>>>>>>>>>> not a real benchmark contrib - though I know more than a  
>>>>>>>>>> few of us
>>>>>>>>>> have
>>>>>>>>>> hacked up pieces of Lucene benchmark to work with Solr - I  
>>>>>>>>>> think
>>>>>>>>>> I've
>>>>>>>>>> done
>>>>>>>>>> it twice now ;)
>>>>>>>>>>
>>>>>>>>>> Would be nice to get things going. I was thinking the other  
>>>>>>>>>> day: I
>>>>>>>>>> wonder
>>>>>>>>>> how hard it would be to make Lucene Benchmark generic  
>>>>>>>>>> enough to
>>>>>>>>>> accept
>>>>>>>>>> Solr
>>>>>>>>>> impls and Solr algs?
>>>>>>>>>>
>>>>>>>>>> It does a lot that would suck to duplicate.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> --
>>>>>>>>>> - Mark
>>>>>>>>>>
>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --------------------------
>>>>>>>>> Grant Ingersoll
>>>>>>>>> http://www.lucidimagination.com/
>>>>>>>>>
>>>>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/
>>>>>>>>> Droids)
>>>>>>>>> using
>>>>>>>>> Solr/Lucene:
>>>>>>>>> http://www.lucidimagination.com/search
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> --------------------------
>>>>>>> Grant Ingersoll
>>>>>>> http://www.lucidimagination.com/
>>>>>>>
>>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/
>>>>>>> Droids)
>>>>>>> using
>>>>>>> Solr/Lucene:
>>>>>>> http://www.lucidimagination.com/search
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> --
>>>>>> - Mark
>>>>>>
>>>>>> http://www.lucidimagination.com
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Peter M. Wolanin, Ph.D.
>>>> Momentum Specialist,  Acquia. Inc.
>>>> [hidden email]
>>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia or reuters like index for testing facets?

Alexandre Rafalovitch
In reply to this post by Jason Rutherglen
I have something that maybe could be made into one: http://uncorpora.org/

It is resolutions of the United Nations General Assembly in 6 official
languages aligned on a paragraph level in an XML (Translation Memory
eXchange) format. The 6 languages are: English, French, Spanish,
Arabic, Chinese, Russian.

Facets could be derived from already encoded information for:
1) Session number: 55-62
2) Committee number: 0-6
3) Operative/preambulatory phrase (for some of the paragraphs)
4) Resolution number (which is part of the record ID)
5) Cross-reference information that is embedded in the text, but is
marked off with XML tags

Markup and all, it is about 170 Mbytes between 6 languages.

If that looks useful, I would be happy to work with more experienced
Solr users to beat it into the right shape.

Regards,
    Alex.

Personal blog: http://blog.outerthoughts.com/
Research group: http://www.clt.mq.edu.au/Research/
- I think age is a very high price to pay for maturity (Tom Stoppard)

On Tue, Jul 14, 2009 at 3:36 PM, Jason
Rutherglen<[hidden email]> wrote:
> Is there a standard index like what Lucene uses for contrib/benchmark for
> executing faceted queries over? Or maybe we can randomly generate one that
> works in conjunction with wikipedia? That way we can execute real world
> queries against faceted data. Or we could use the Lucene/Solr mailing lists
> and other data (ala Lucid's faceted site) as a standard index?