Interesting Grouping/Facet issue

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Interesting Grouping/Facet issue

Erie Data Systems
Solr 8.0.0, I have a HASHTAG string field I am trying to facet on to get
the most popular hashtags (top 100) across many sources. (SITE field is
string)

/select?facet.field=hashtag&facet=on&rows=0&q=%2Bhashtag:*%20%2BDT:[" .
date('Y-m-d') . "T00:00:00Z+TO+" . date('Y-m-d')  .
"T23:59:59Z]&facet.limit=100&facet.mincount=1&facet.method=fc

It works but not to what I feel should happen... For example if one site
has 1000 rows on todays date and they all have a HASHTAG in common, that
HASHTAG automatically rises to the top simply because one SITE has 1000
pages with the same HASHTAG.

Is there a way to get a better more even distribution of top HASHTAGS for a
given date, ie facet. ..by a grouping or distinct or filter of some sort?
Im more interesting in knowing if a HASHTAG is used frequently among SITEs,
not just one one.

Hope this makes sense... any recommendations welcomed.

Thank you in advance,
-Craig
Reply | Threaded
Open this post in threaded view
|

Re: Interesting Grouping/Facet issue

Shawn Heisey-2
On 4/9/2019 7:03 AM, Erie Data Systems wrote:

> Solr 8.0.0, I have a HASHTAG string field I am trying to facet on to get
> the most popular hashtags (top 100) across many sources. (SITE field is
> string)
>
> /select?facet.field=hashtag&facet=on&rows=0&q=%2Bhashtag:*%20%2BDT:[" .
> date('Y-m-d') . "T00:00:00Z+TO+" . date('Y-m-d')  .
> "T23:59:59Z]&facet.limit=100&facet.mincount=1&facet.method=fc
>
> It works but not to what I feel should happen... For example if one site
> has 1000 rows on todays date and they all have a HASHTAG in common, that
> HASHTAG automatically rises to the top simply because one SITE has 1000
> pages with the same HASHTAG.

That is exactly what faceting is designed to do.  It is behaving exactly
as designed.

> Is there a way to get a better more even distribution of top HASHTAGS for a
> given date, ie facet. ..by a grouping or distinct or filter of some sort?
> Im more interesting in knowing if a HASHTAG is used frequently among SITEs,
> not just one one.

If you use pivot facets, first on the field you want to classify on,
then on HASHTAG, that MIGHT get you what you want.

You could also try running many different facet queries, each one with a
specific query and/or filter that achieves the results you want.

FYI:  Including "hashtag:*" in your query makes it a wildcard query.
This is most likely VERY slow.  If you are trying to match all possible
values in the hashtag field, then take it out, it's unnecessary.  If you
are trying to match only documents where hashtag contains a value, then
replace it with this for a performance improvement:

hashtag:[* TO *]

Range queries are almost always faster than wildcards.

Thanks,
Shawn