Extracting top level URL when indexing document

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Extracting top level URL when indexing document

Hanjan, Harinder
Hello!

I am indexing web documents and have a need to extract their top-level URL to be stored in a different field. I have had some success with the PatternTokenizerFactory (relevant schema bits at the bottom) but the behavior appears to be inconsistent.  Most of the times, the top level URL is extracted just fine but for some documents, it is being cut off.

Examples:
URL

Extracted URL

Comment

http://www.calgaryarb.ca/eCourtPublic/15M2018.pdf

http://www.calgaryarb.ca

Success

http://www.calgarymlc.ca/about-cmlc/

http://www.calgarymlc.ca

Success

http://www.calgarypolicecommission.ca/reports.php

http://www.calgarypolicecommissio

Fail

https://attainyourhome.com/

https://attai

Fail

https://liveandplay.calgary.ca/DROPIN/page/dropin

https://livea

Fail




Relevant schema:
<copyField dest="hostname" source="SolrId"/>

<field name="hostname" type="hostnameType" stored="true" indexed="false" multiValued="false"/>

<fieldType name="hostnameType" class="solr.TextField" sortMissingLast="true">
                <analyzer type="index">
                                <tokenizer
                                                class="solr.PatternTokenizerFactory"
                                                pattern="^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+)"
                                                group="0"/>
                </analyzer>
</fieldType>


I have tested the Regex and it is matching things fine. Please see https://regex101.com/r/wN6cZ7/358.
So it appears that I have a gap in my understanding of how Solr PatternTokenizerFactory works. I would appreciate any insight on the issue. hostname field will be used in facet queries.

Thank you!
Harinder

________________________________
NOTICE -
This communication is intended ONLY for the use of the person or entity named above and may contain information that is confidential or legally privileged. If you are not the intended recipient named above or a person responsible for delivering messages or communications to the intended recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying of this communication or any of the information contained in it is strictly prohibited. If you have received this communication in error, please notify us immediately by telephone and then destroy or delete this communication, or return it to us by mail if requested by us. The City of Calgary thanks you for your attention and co-operation.
Reply | Threaded
Open this post in threaded view
|

Re: Extracting top level URL when indexing document

Kevin Risden-3
Looks like stop words (in, and, on) is what is breaking. The regex looks
like it is correct.

Kevin Risden

On Tue, Jun 12, 2018, 18:02 Hanjan, Harinder <[hidden email]>
wrote:

> Hello!
>
> I am indexing web documents and have a need to extract their top-level URL
> to be stored in a different field. I have had some success with the
> PatternTokenizerFactory (relevant schema bits at the bottom) but the
> behavior appears to be inconsistent.  Most of the times, the top level URL
> is extracted just fine but for some documents, it is being cut off.
>
> Examples:
> URL
>
> Extracted URL
>
> Comment
>
> http://www.calgaryarb.ca/eCourtPublic/15M2018.pdf
>
> http://www.calgaryarb.ca
>
> Success
>
> http://www.calgarymlc.ca/about-cmlc/
>
> http://www.calgarymlc.ca
>
> Success
>
> http://www.calgarypolicecommission.ca/reports.php
>
> http://www.calgarypolicecommissio
>
> Fail
>
> https://attainyourhome.com/
>
> https://attai
>
> Fail
>
> https://liveandplay.calgary.ca/DROPIN/page/dropin
>
> https://livea
>
> Fail
>
>
>
>
> Relevant schema:
> <copyField dest="hostname" source="SolrId"/>
>
> <field name="hostname" type="hostnameType" stored="true" indexed="false"
> multiValued="false"/>
>
> <fieldType name="hostnameType" class="solr.TextField"
> sortMissingLast="true">
>                 <analyzer type="index">
>                                 <tokenizer
>
> class="solr.PatternTokenizerFactory"
>
> pattern="^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+)"
>                                                 group="0"/>
>                 </analyzer>
> </fieldType>
>
>
> I have tested the Regex and it is matching things fine. Please see
> https://regex101.com/r/wN6cZ7/358.
> So it appears that I have a gap in my understanding of how Solr
> PatternTokenizerFactory works. I would appreciate any insight on the issue.
> hostname field will be used in facet queries.
>
> Thank you!
> Harinder
>
> ________________________________
> NOTICE -
> This communication is intended ONLY for the use of the person or entity
> named above and may contain information that is confidential or legally
> privileged. If you are not the intended recipient named above or a person
> responsible for delivering messages or communications to the intended
> recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying
> of this communication or any of the information contained in it is strictly
> prohibited. If you have received this communication in error, please notify
> us immediately by telephone and then destroy or delete this communication,
> or return it to us by mail if requested by us. The City of Calgary thanks
> you for your attention and co-operation.
>
Reply | Threaded
Open this post in threaded view
|

Re: Extracting top level URL when indexing document

Alexandre Rafalovitch
In reply to this post by Hanjan, Harinder
Try URLClassifyProcessorFactory in the processing chain instead, configured
in solrconfig.xml

There is very little documentation for it, so check the source for exact
params. Or search for the blog post introducing it several years ago.

Documentation patches would be welcome.

Regards,
    Alex

On Wed, Jun 13, 2018, 01:02 Hanjan, Harinder, <[hidden email]>
wrote:

> Hello!
>
> I am indexing web documents and have a need to extract their top-level URL
> to be stored in a different field. I have had some success with the
> PatternTokenizerFactory (relevant schema bits at the bottom) but the
> behavior appears to be inconsistent.  Most of the times, the top level URL
> is extracted just fine but for some documents, it is being cut off.
>
> Examples:
> URL
>
> Extracted URL
>
> Comment
>
> http://www.calgaryarb.ca/eCourtPublic/15M2018.pdf
>
> http://www.calgaryarb.ca
>
> Success
>
> http://www.calgarymlc.ca/about-cmlc/
>
> http://www.calgarymlc.ca
>
> Success
>
> http://www.calgarypolicecommission.ca/reports.php
>
> http://www.calgarypolicecommissio
>
> Fail
>
> https://attainyourhome.com/
>
> https://attai
>
> Fail
>
> https://liveandplay.calgary.ca/DROPIN/page/dropin
>
> https://livea
>
> Fail
>
>
>
>
> Relevant schema:
> <copyField dest="hostname" source="SolrId"/>
>
> <field name="hostname" type="hostnameType" stored="true" indexed="false"
> multiValued="false"/>
>
> <fieldType name="hostnameType" class="solr.TextField"
> sortMissingLast="true">
>                 <analyzer type="index">
>                                 <tokenizer
>
> class="solr.PatternTokenizerFactory"
>
> pattern="^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+)"
>                                                 group="0"/>
>                 </analyzer>
> </fieldType>
>
>
> I have tested the Regex and it is matching things fine. Please see
> https://regex101.com/r/wN6cZ7/358.
> So it appears that I have a gap in my understanding of how Solr
> PatternTokenizerFactory works. I would appreciate any insight on the issue.
> hostname field will be used in facet queries.
>
> Thank you!
> Harinder
>
> ________________________________
> NOTICE -
> This communication is intended ONLY for the use of the person or entity
> named above and may contain information that is confidential or legally
> privileged. If you are not the intended recipient named above or a person
> responsible for delivering messages or communications to the intended
> recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying
> of this communication or any of the information contained in it is strictly
> prohibited. If you have received this communication in error, please notify
> us immediately by telephone and then destroy or delete this communication,
> or return it to us by mail if requested by us. The City of Calgary thanks
> you for your attention and co-operation.
>
Reply | Threaded
Open this post in threaded view
|

RE: [EXT] Re: Extracting top level URL when indexing document

Hanjan, Harinder
Thank you Alex.  I have managed to get this to work via URLClassifyProcessorFactory. If anyone is interested, it can be easily done via with the following solrconfig.xml

<updateRequestProcessorChain name="urlProcessor">
        <processor class="org.apache.solr.update.processor.URLClassifyProcessorFactory">
                  <bool name="enabled">true</bool>
                  <str name="inputField">SolrId</str>
                  <str name="domainOutputField">hostname</str>
                  </processor>
        <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

<requestHandler name="/update" class="solr.UpdateRequestHandler">
        <lst name="defaults">
         <str name="update.chain">urlProcessor</str>
       </lst>
  </requestHandler>

I will look at how to submit a patch to the Java doc.

Thanks!
Harinder

-----Original Message-----
From: Alexandre Rafalovitch [mailto:[hidden email]]
Sent: Wednesday, June 13, 2018 12:13 AM
To: solr-user <[hidden email]>
Subject: [EXT] Re: Extracting top level URL when indexing document

Try URLClassifyProcessorFactory in the processing chain instead, configured in solrconfig.xml

There is very little documentation for it, so check the source for exact params. Or search for the blog post introducing it several years ago.

Documentation patches would be welcome.

Regards,
    Alex

On Wed, Jun 13, 2018, 01:02 Hanjan, Harinder, <[hidden email]>
wrote:

> Hello!
>
> I am indexing web documents and have a need to extract their top-level
> URL to be stored in a different field. I have had some success with
> the PatternTokenizerFactory (relevant schema bits at the bottom) but
> the behavior appears to be inconsistent.  Most of the times, the top
> level URL is extracted just fine but for some documents, it is being cut off.
>
> Examples:
> URL
>
> Extracted URL
>
> Comment
>
> http://www.calgaryarb.ca/eCourtPublic/15M2018.pdf
>
> http://www.calgaryarb.ca
>
> Success
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.calgarymlc.ca_
> about-2Dcmlc_&d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r
> =N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=k9FRjoXpHpJRD0Z2_RDYL1n
> vhzANSYzX_MuFCGcxdD4&s=bAlhGU5kNa_tlJbhmb8vEe3gRIF9vcH7de6UJL-mM28&e=
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.calgarymlc.ca&
> d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhV
> Hu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_MuFC
> GcxdD4&s=-4gwWSR2Uut2C-JHJ3c0Uj0Ys0W4APyH7if3WXsEvqU&e=
>
> Success
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.calgarypolicec
> ommission.ca_reports.php&d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNk
> yKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=k9FRjoXpHpJR
> D0Z2_RDYL1nvhzANSYzX_MuFCGcxdD4&s=ZfPgYWPLxqnMbfYceg-RObyXzSmmcPTU0t8Z
> 55ZVbY4&e=
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.calgarypolicec
> ommissio&d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30I
> rhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=k9FRjoXpHpJRD0Z2_RDYL1nvhzAN
> SYzX_MuFCGcxdD4&s=BM-LaN4V7PlZW3_vm6prIX-NS3EW1zPz42Cy25S9HxU&e=
>
> Fail
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__attainyourhome.co
> m_&d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeK
> KhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_M
> uFCGcxdD4&s=bHYfs9IWkicyxYn5tZN0EtKNIA1O9MCyrDMVxG1Kn1g&e=
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__attai&d=DwIBaQ&c=
> jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO
> 9CysWnvGGoKrSNEuM3U&m=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_MuFCGcxdD4&s=9k
> DXPBHblDyQp9yLzYAyGTvboVZDKrzUK3jYYLmJLTI&e=
>
> Fail
>
> https://liveandplay.calgary.ca/DROPIN/page/dropin
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__livea&d=DwIBaQ&c=
> jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO
> 9CysWnvGGoKrSNEuM3U&m=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_MuFCGcxdD4&s=Xy
> mwSoyJw0F3EqGH7zaDoSJBIu-oVNFxmnVxOnDghJc&e=
>
> Fail
>
>
>
>
> Relevant schema:
> <copyField dest="hostname" source="SolrId"/>
>
> <field name="hostname" type="hostnameType" stored="true" indexed="false"
> multiValued="false"/>
>
> <fieldType name="hostnameType" class="solr.TextField"
> sortMissingLast="true">
>                 <analyzer type="index">
>                                 <tokenizer
>
> class="solr.PatternTokenizerFactory"
>
> pattern="^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+)"
>                                                 group="0"/>
>                 </analyzer>
> </fieldType>
>
>
> I have tested the Regex and it is matching things fine. Please see
> https://urldefense.proofpoint.com/v2/url?u=https-3A__regex101.com_r_wN6cZ7_358&d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_MuFCGcxdD4&s=U-s-VXfldf8O1uoyOmy_hf3jRuTUml1MMV8YxF-RWUc&e=.
> So it appears that I have a gap in my understanding of how Solr
> PatternTokenizerFactory works. I would appreciate any insight on the issue.
> hostname field will be used in facet queries.
>
> Thank you!
> Harinder
>
> ________________________________
> NOTICE -
> This communication is intended ONLY for the use of the person or
> entity named above and may contain information that is confidential or
> legally privileged. If you are not the intended recipient named above
> or a person responsible for delivering messages or communications to
> the intended recipient, YOU ARE HEREBY NOTIFIED that any use,
> distribution, or copying of this communication or any of the
> information contained in it is strictly prohibited. If you have
> received this communication in error, please notify us immediately by
> telephone and then destroy or delete this communication, or return it
> to us by mail if requested by us. The City of Calgary thanks you for your attention and co-operation.
>
Reply | Threaded
Open this post in threaded view
|

Re: Extracting top level URL when indexing document

Gus Heck
In reply to this post by Kevin Risden-3
I don't understand the inclusion of 'n' in the character classes in this
pattern... it's pretty clear that the broken examples in the OP were where
the letter n occurred in the domain name. I expect a similar problem for
user parts that contain n...

^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+)

On Tue, Jun 12, 2018 at 7:15 PM, Kevin Risden <[hidden email]> wrote:

> Looks like stop words (in, and, on) is what is breaking. The regex looks
> like it is correct.
>
> Kevin Risden
>
> On Tue, Jun 12, 2018, 18:02 Hanjan, Harinder <[hidden email]>
> wrote:
>
> > Hello!
> >
> > I am indexing web documents and have a need to extract their top-level
> URL
> > to be stored in a different field. I have had some success with the
> > PatternTokenizerFactory (relevant schema bits at the bottom) but the
> > behavior appears to be inconsistent.  Most of the times, the top level
> URL
> > is extracted just fine but for some documents, it is being cut off.
> >
> > Examples:
> > URL
> >
> > Extracted URL
> >
> > Comment
> >
> > http://www.calgaryarb.ca/eCourtPublic/15M2018.pdf
> >
> > http://www.calgaryarb.ca
> >
> > Success
> >
> > http://www.calgarymlc.ca/about-cmlc/
> >
> > http://www.calgarymlc.ca
> >
> > Success
> >
> > http://www.calgarypolicecommission.ca/reports.php
> >
> > http://www.calgarypolicecommissio
> >
> > Fail
> >
> > https://attainyourhome.com/
> >
> > https://attai
> >
> > Fail
> >
> > https://liveandplay.calgary.ca/DROPIN/page/dropin
> >
> > https://livea
> >
> > Fail
> >
> >
> >
> >
> > Relevant schema:
> > <copyField dest="hostname" source="SolrId"/>
> >
> > <field name="hostname" type="hostnameType" stored="true" indexed="false"
> > multiValued="false"/>
> >
> > <fieldType name="hostnameType" class="solr.TextField"
> > sortMissingLast="true">
> >                 <analyzer type="index">
> >                                 <tokenizer
> >
> > class="solr.PatternTokenizerFactory"
> >
> > pattern="^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+)"
> >                                                 group="0"/>
> >                 </analyzer>
> > </fieldType>
> >
> >
> > I have tested the Regex and it is matching things fine. Please see
> > https://regex101.com/r/wN6cZ7/358.
> > So it appears that I have a gap in my understanding of how Solr
> > PatternTokenizerFactory works. I would appreciate any insight on the
> issue.
> > hostname field will be used in facet queries.
> >
> > Thank you!
> > Harinder
> >
> > ________________________________
> > NOTICE -
> > This communication is intended ONLY for the use of the person or entity
> > named above and may contain information that is confidential or legally
> > privileged. If you are not the intended recipient named above or a person
> > responsible for delivering messages or communications to the intended
> > recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying
> > of this communication or any of the information contained in it is
> strictly
> > prohibited. If you have received this communication in error, please
> notify
> > us immediately by telephone and then destroy or delete this
> communication,
> > or return it to us by mail if requested by us. The City of Calgary thanks
> > you for your attention and co-operation.
> >
>



--
http://www.the111shift.com
Reply | Threaded
Open this post in threaded view
|

Faceting with a multi valued field

Hanjan, Harinder
In reply to this post by Kevin Risden-3
Hello!

I am doing faceting on a field which has multiple values and it's yielding expected but undesireable results. I need different behaviour but not sure how to formulate a query for it. Here is my current setup.

===== Data Set =====
  {
"Communities":["BANFF TRAIL - BNF", "PARKDALE - PKD"],
"Document Type":"Engagement - What We Heard Report",
"Navigation":"Livelink",
"SolrId":"http://thesimpsons.com/one"
  }
  {
"Communities":["BANFF TRAIL - BNF", "PARKDALE - PKD"],
"Document Type":"Engagement - What We Heard Report",
"Navigation":"Livelink",
"Id":"http://thesimpsons.com/two"
  }
  {
"Communities":["SUNALTA - SNA"],
"Document Type":"Engagement - What We Heard Report",
"Navigation":"Livelink",
"Id":"http://thesimpsons.com/three"
  }

===== Query I run now =====
http://localhost:8984/solr/everything/select?q=*:*&facet=on&facet.field=Communities&fq=Communities:"BANFF TRAIL - BNF"


===== Results I get now =====
{
  ...
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "Communities":[
        "BANFF TRAIL - BNF",2,
        "PARKDALE - PKD",2,
        "SUNALTA - SNA",0]},
   ...

Notice that the Communities facet has 2 non zero results. I understand this is because I'm using fq to get only documents which contain BANFF TRAIL but those documents also contain PARKDALE.

Now, I am using facets to drive navigation on my page. The business case is that user can select a community to get documents pertaining to that specific community only. This works with the query I have above. However, the facets results also contain other communities which then get displayed to the user. For example, with the query above, user will see both BANFF TRAIL and PARKDALE as selected values even though user only selected BANFF TRAIL. It's worthwhile noting that I have no control over the data being sent to Solr and can't change it.

How can I formulate a query to ensure that when user selects BANFF TRAIL, only BANFF TRAIL is returned under Solr facets?

Thanks!
Harinder

________________________________
NOTICE -
This communication is intended ONLY for the use of the person or entity named above and may contain information that is confidential or legally privileged. If you are not the intended recipient named above or a person responsible for delivering messages or communications to the intended recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying of this communication or any of the information contained in it is strictly prohibited. If you have received this communication in error, please notify us immediately by telephone and then destroy or delete this communication, or return it to us by mail if requested by us. The City of Calgary thanks you for your attention and co-operation.
Reply | Threaded
Open this post in threaded view
|

Re: Faceting with a multi valued field

John Blythe-2
you can update your filter query to be a facet query, this will apply the
query to the resulting facet set instead of the Communities field itself.

--
John Blythe


On Tue, Sep 25, 2018 at 4:15 PM Hanjan, Harinder <[hidden email]>
wrote:

> Hello!
>
> I am doing faceting on a field which has multiple values and it's yielding
> expected but undesireable results. I need different behaviour but not sure
> how to formulate a query for it. Here is my current setup.
>
> ===== Data Set =====
>   {
> "Communities":["BANFF TRAIL - BNF", "PARKDALE - PKD"],
> "Document Type":"Engagement - What We Heard Report",
> "Navigation":"Livelink",
> "SolrId":"http://thesimpsons.com/one"
>   }
>   {
> "Communities":["BANFF TRAIL - BNF", "PARKDALE - PKD"],
> "Document Type":"Engagement - What We Heard Report",
> "Navigation":"Livelink",
> "Id":"http://thesimpsons.com/two"
>   }
>   {
> "Communities":["SUNALTA - SNA"],
> "Document Type":"Engagement - What We Heard Report",
> "Navigation":"Livelink",
> "Id":"http://thesimpsons.com/three"
>   }
>
> ===== Query I run now =====
>
> http://localhost:8984/solr/everything/select?q=*:*&facet=on&facet.field=Communities&fq=Communities:"BANFF
> TRAIL - BNF"
>
>
> ===== Results I get now =====
> {
>   ...
>   "facet_counts":{
>     "facet_queries":{},
>     "facet_fields":{
>       "Communities":[
>         "BANFF TRAIL - BNF",2,
>         "PARKDALE - PKD",2,
>         "SUNALTA - SNA",0]},
>    ...
>
> Notice that the Communities facet has 2 non zero results. I understand
> this is because I'm using fq to get only documents which contain BANFF
> TRAIL but those documents also contain PARKDALE.
>
> Now, I am using facets to drive navigation on my page. The business case
> is that user can select a community to get documents pertaining to that
> specific community only. This works with the query I have above. However,
> the facets results also contain other communities which then get displayed
> to the user. For example, with the query above, user will see both BANFF
> TRAIL and PARKDALE as selected values even though user only selected BANFF
> TRAIL. It's worthwhile noting that I have no control over the data being
> sent to Solr and can't change it.
>
> How can I formulate a query to ensure that when user selects BANFF TRAIL,
> only BANFF TRAIL is returned under Solr facets?
>
> Thanks!
> Harinder
>
> ________________________________
> NOTICE -
> This communication is intended ONLY for the use of the person or entity
> named above and may contain information that is confidential or legally
> privileged. If you are not the intended recipient named above or a person
> responsible for delivering messages or communications to the intended
> recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying
> of this communication or any of the information contained in it is strictly
> prohibited. If you have received this communication in error, please notify
> us immediately by telephone and then destroy or delete this communication,
> or return it to us by mail if requested by us. The City of Calgary thanks
> you for your attention and co-operation.
>
Reply | Threaded
Open this post in threaded view
|

Re: Faceting with a multi valued field

Alexandre Rafalovitch
What specifically do you control? Just keyword (and "Communities:"
part is locked?) or anything after q= or anything that allows multiple
variables?

Because if you could isolate search value, you could use for example
facet.prefix, set in solrconfig as a default parameter and populated
from the same variable as the Communities search.

You may also want to set facet.mincount=1 in solrconfig.xml to avoid
0-value facets in general:
https://lucene.apache.org/solr/guide/7_4/faceting.html

Regards,
   Alex.


On 25 September 2018 at 16:50, John Blythe <[hidden email]> wrote:

> you can update your filter query to be a facet query, this will apply the
> query to the resulting facet set instead of the Communities field itself.
>
> --
> John Blythe
>
>
> On Tue, Sep 25, 2018 at 4:15 PM Hanjan, Harinder <[hidden email]>
> wrote:
>
>> Hello!
>>
>> I am doing faceting on a field which has multiple values and it's yielding
>> expected but undesireable results. I need different behaviour but not sure
>> how to formulate a query for it. Here is my current setup.
>>
>> ===== Data Set =====
>>   {
>> "Communities":["BANFF TRAIL - BNF", "PARKDALE - PKD"],
>> "Document Type":"Engagement - What We Heard Report",
>> "Navigation":"Livelink",
>> "SolrId":"http://thesimpsons.com/one"
>>   }
>>   {
>> "Communities":["BANFF TRAIL - BNF", "PARKDALE - PKD"],
>> "Document Type":"Engagement - What We Heard Report",
>> "Navigation":"Livelink",
>> "Id":"http://thesimpsons.com/two"
>>   }
>>   {
>> "Communities":["SUNALTA - SNA"],
>> "Document Type":"Engagement - What We Heard Report",
>> "Navigation":"Livelink",
>> "Id":"http://thesimpsons.com/three"
>>   }
>>
>> ===== Query I run now =====
>>
>> http://localhost:8984/solr/everything/select?q=*:*&facet=on&facet.field=Communities&fq=Communities:"BANFF
>> TRAIL - BNF"
>>
>>
>> ===== Results I get now =====
>> {
>>   ...
>>   "facet_counts":{
>>     "facet_queries":{},
>>     "facet_fields":{
>>       "Communities":[
>>         "BANFF TRAIL - BNF",2,
>>         "PARKDALE - PKD",2,
>>         "SUNALTA - SNA",0]},
>>    ...
>>
>> Notice that the Communities facet has 2 non zero results. I understand
>> this is because I'm using fq to get only documents which contain BANFF
>> TRAIL but those documents also contain PARKDALE.
>>
>> Now, I am using facets to drive navigation on my page. The business case
>> is that user can select a community to get documents pertaining to that
>> specific community only. This works with the query I have above. However,
>> the facets results also contain other communities which then get displayed
>> to the user. For example, with the query above, user will see both BANFF
>> TRAIL and PARKDALE as selected values even though user only selected BANFF
>> TRAIL. It's worthwhile noting that I have no control over the data being
>> sent to Solr and can't change it.
>>
>> How can I formulate a query to ensure that when user selects BANFF TRAIL,
>> only BANFF TRAIL is returned under Solr facets?
>>
>> Thanks!
>> Harinder
>>
>> ________________________________
>> NOTICE -
>> This communication is intended ONLY for the use of the person or entity
>> named above and may contain information that is confidential or legally
>> privileged. If you are not the intended recipient named above or a person
>> responsible for delivering messages or communications to the intended
>> recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying
>> of this communication or any of the information contained in it is strictly
>> prohibited. If you have received this communication in error, please notify
>> us immediately by telephone and then destroy or delete this communication,
>> or return it to us by mail if requested by us. The City of Calgary thanks
>> you for your attention and co-operation.
>>
Reply | Threaded
Open this post in threaded view
|

RE: [EXT] Re: Faceting with a multi valued field

Hanjan, Harinder
In reply to this post by John Blythe-2
John,

I just want to make sure I understand correctly. Replace, fq with facet.query?

So then the resultant query goes from:
q=*:*&facet=on&facet.field=Communities&fq=Communities:"BANFF TRAIL - BNF"

to:
q=*:*&facet=on&facet.field=Communities&facet.query="BANFF TRAIL - BNF"


If that's correct, then this does not resolve the issue. I still get 2 values under Communities facet.

Harinder

-----Original Message-----
From: John Blythe [mailto:[hidden email]]
Sent: Tuesday, September 25, 2018 2:50 PM
To: [hidden email]
Subject: [EXT] Re: Faceting with a multi valued field

you can update your filter query to be a facet query, this will apply the query to the resulting facet set instead of the Communities field itself.

--
John Blythe


On Tue, Sep 25, 2018 at 4:15 PM Hanjan, Harinder <[hidden email]>
wrote:

> Hello!
>
> I am doing faceting on a field which has multiple values and it's
> yielding expected but undesireable results. I need different behaviour
> but not sure how to formulate a query for it. Here is my current setup.
>
> ===== Data Set =====
>   {
> "Communities":["BANFF TRAIL - BNF", "PARKDALE - PKD"], "Document
> Type":"Engagement - What We Heard Report", "Navigation":"Livelink",
> "SolrId":"https://urldefense.proofpoint.com/v2/url?u=http-3A__thesimpsons.com_one&d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=PX7TJqsA8tYgbN7HmkGd0GNzotXPc3hcoc9xRvmOiXI&s=GMTgF731T72VIryx_v7VD5f_oBlbrzXYAB1UEBQMOOc&e="
>   }
>   {
> "Communities":["BANFF TRAIL - BNF", "PARKDALE - PKD"], "Document
> Type":"Engagement - What We Heard Report", "Navigation":"Livelink",
> "Id":"https://urldefense.proofpoint.com/v2/url?u=http-3A__thesimpsons.com_two&d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=PX7TJqsA8tYgbN7HmkGd0GNzotXPc3hcoc9xRvmOiXI&s=FN6T49z8wjc_mRdXnVHgcdZBcZB6O_InSyUzxaxxiM0&e="
>   }
>   {
> "Communities":["SUNALTA - SNA"],
> "Document Type":"Engagement - What We Heard Report",
> "Navigation":"Livelink",
> "Id":"https://urldefense.proofpoint.com/v2/url?u=http-3A__thesimpsons.com_three&d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=PX7TJqsA8tYgbN7HmkGd0GNzotXPc3hcoc9xRvmOiXI&s=HEJFyAhHIn5T-riqVVMR011KXAn38lZUDyRQ-ljC-qA&e="
>   }
>
> ===== Query I run now =====
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8984_s
> olr_everything_select-3Fq-3D-2A-3A-2A-26facet-3Don-26facet.field-3DCom
> munities-26fq-3DCommunities-3A-2522BANFF&d=DwIBaQ&c=jdm1Hby_BzoqwoYzPs
> UCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3
> U&m=PX7TJqsA8tYgbN7HmkGd0GNzotXPc3hcoc9xRvmOiXI&s=Cx6EubqN_-ocYrZA6jsJ
> TGzodPqUPVu78eY1iMB_0L8&e=
> TRAIL - BNF"
>
>
> ===== Results I get now =====
> {
>   ...
>   "facet_counts":{
>     "facet_queries":{},
>     "facet_fields":{
>       "Communities":[
>         "BANFF TRAIL - BNF",2,
>         "PARKDALE - PKD",2,
>         "SUNALTA - SNA",0]},
>    ...
>
> Notice that the Communities facet has 2 non zero results. I understand
> this is because I'm using fq to get only documents which contain BANFF
> TRAIL but those documents also contain PARKDALE.
>
> Now, I am using facets to drive navigation on my page. The business
> case is that user can select a community to get documents pertaining
> to that specific community only. This works with the query I have
> above. However, the facets results also contain other communities
> which then get displayed to the user. For example, with the query
> above, user will see both BANFF TRAIL and PARKDALE as selected values
> even though user only selected BANFF TRAIL. It's worthwhile noting
> that I have no control over the data being sent to Solr and can't change it.
>
> How can I formulate a query to ensure that when user selects BANFF
> TRAIL, only BANFF TRAIL is returned under Solr facets?
>
> Thanks!
> Harinder
>
> ________________________________
> NOTICE -
> This communication is intended ONLY for the use of the person or
> entity named above and may contain information that is confidential or
> legally privileged. If you are not the intended recipient named above
> or a person responsible for delivering messages or communications to
> the intended recipient, YOU ARE HEREBY NOTIFIED that any use,
> distribution, or copying of this communication or any of the
> information contained in it is strictly prohibited. If you have
> received this communication in error, please notify us immediately by
> telephone and then destroy or delete this communication, or return it
> to us by mail if requested by us. The City of Calgary thanks you for your attention and co-operation.
>
Reply | Threaded
Open this post in threaded view
|

RE: [EXT] Re: Faceting with a multi valued field

Hanjan, Harinder
In reply to this post by Alexandre Rafalovitch
I control everything except the data that's being indexed. So I can manipulate the Solr query as needed.

I tried the facet.prefix option and initial testing shows promise.
q=*:*&facet=on&facet.field=Communities&f.Communities.facet.prefix=BANFF+TRAIL+-+BNF

Thanks much!


-----Original Message-----
From: Alexandre Rafalovitch [mailto:[hidden email]]
Sent: Tuesday, September 25, 2018 3:14 PM
To: solr-user
Subject: [EXT] Re: Faceting with a multi valued field

What specifically do you control? Just keyword (and "Communities:"
part is locked?) or anything after q= or anything that allows multiple variables?

Because if you could isolate search value, you could use for example facet.prefix, set in solrconfig as a default parameter and populated from the same variable as the Communities search.

You may also want to set facet.mincount=1 in solrconfig.xml to avoid 0-value facets in general:
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F4_faceting.html&d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=xAdIgtTdaZYLG3jsYsLQqWtQBb9-cHsyG58r_mvTm-E&s=RgNvfB_bRwAfe9NpY1HedFlSHUNY0QbZ4VCXTzduTMo&e=

Regards,
   Alex.


On 25 September 2018 at 16:50, John Blythe <[hidden email]> wrote:

> you can update your filter query to be a facet query, this will apply
> the query to the resulting facet set instead of the Communities field itself.
>
> --
> John Blythe
>
>
> On Tue, Sep 25, 2018 at 4:15 PM Hanjan, Harinder
> <[hidden email]>
> wrote:
>
>> Hello!
>>
>> I am doing faceting on a field which has multiple values and it's
>> yielding expected but undesireable results. I need different
>> behaviour but not sure how to formulate a query for it. Here is my current setup.
>>
>> ===== Data Set =====
>>   {
>> "Communities":["BANFF TRAIL - BNF", "PARKDALE - PKD"], "Document
>> Type":"Engagement - What We Heard Report", "Navigation":"Livelink",
>> "SolrId":"https://urldefense.proofpoint.com/v2/url?u=http-3A__thesimpsons.com_one&d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=xAdIgtTdaZYLG3jsYsLQqWtQBb9-cHsyG58r_mvTm-E&s=-ZCoMFGNAEILlQOvY1Stra9dCF-rM48tZSTT3QJcOA0&e="
>>   }
>>   {
>> "Communities":["BANFF TRAIL - BNF", "PARKDALE - PKD"], "Document
>> Type":"Engagement - What We Heard Report", "Navigation":"Livelink",
>> "Id":"https://urldefense.proofpoint.com/v2/url?u=http-3A__thesimpsons.com_two&d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=xAdIgtTdaZYLG3jsYsLQqWtQBb9-cHsyG58r_mvTm-E&s=_JPFUX0e0zqyJWHQzWH815ThZAsdGu5TwDSkXBIL23Q&e="
>>   }
>>   {
>> "Communities":["SUNALTA - SNA"],
>> "Document Type":"Engagement - What We Heard Report",
>> "Navigation":"Livelink",
>> "Id":"https://urldefense.proofpoint.com/v2/url?u=http-3A__thesimpsons.com_three&d=DwIBaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=xAdIgtTdaZYLG3jsYsLQqWtQBb9-cHsyG58r_mvTm-E&s=scFc0GYxSyRaAiAmu4M3AvYNiMsgqffG1Jmko76YjH8&e="
>>   }
>>
>> ===== Query I run now =====
>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8984_
>> solr_everything_select-3Fq-3D-2A-3A-2A-26facet-3Don-26facet.field-3DC
>> ommunities-26fq-3DCommunities-3A-2522BANFF&d=DwIBaQ&c=jdm1Hby_BzoqwoY
>> zPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSN
>> EuM3U&m=xAdIgtTdaZYLG3jsYsLQqWtQBb9-cHsyG58r_mvTm-E&s=G7NJKKdDNh0wP5l
>> sjrSQnmbT77hUTSgx2giYBuQFdEI&e=
>> TRAIL - BNF"
>>
>>
>> ===== Results I get now =====
>> {
>>   ...
>>   "facet_counts":{
>>     "facet_queries":{},
>>     "facet_fields":{
>>       "Communities":[
>>         "BANFF TRAIL - BNF",2,
>>         "PARKDALE - PKD",2,
>>         "SUNALTA - SNA",0]},
>>    ...
>>
>> Notice that the Communities facet has 2 non zero results. I
>> understand this is because I'm using fq to get only documents which
>> contain BANFF TRAIL but those documents also contain PARKDALE.
>>
>> Now, I am using facets to drive navigation on my page. The business
>> case is that user can select a community to get documents pertaining
>> to that specific community only. This works with the query I have
>> above. However, the facets results also contain other communities
>> which then get displayed to the user. For example, with the query
>> above, user will see both BANFF TRAIL and PARKDALE as selected values
>> even though user only selected BANFF TRAIL. It's worthwhile noting
>> that I have no control over the data being sent to Solr and can't change it.
>>
>> How can I formulate a query to ensure that when user selects BANFF
>> TRAIL, only BANFF TRAIL is returned under Solr facets?
>>
>> Thanks!
>> Harinder
>>
>> ________________________________
>> NOTICE -
>> This communication is intended ONLY for the use of the person or
>> entity named above and may contain information that is confidential
>> or legally privileged. If you are not the intended recipient named
>> above or a person responsible for delivering messages or
>> communications to the intended recipient, YOU ARE HEREBY NOTIFIED
>> that any use, distribution, or copying of this communication or any
>> of the information contained in it is strictly prohibited. If you
>> have received this communication in error, please notify us
>> immediately by telephone and then destroy or delete this
>> communication, or return it to us by mail if requested by us. The City of Calgary thanks you for your attention and co-operation.
>>
Reply | Threaded
Open this post in threaded view
|

Re: Faceting with a multi valued field

Shawn Heisey-2
In reply to this post by Hanjan, Harinder
On 9/25/2018 2:14 PM, Hanjan, Harinder wrote:
> Hello!

When starting a new topic on the mailing list, do not reply to an
existing message.  Your thread is buried within a thread originally
titled "Extracting top level URL when indexing document".

https://home.apache.org/~hossman/#threadhijack

> Notice that the Communities facet has 2 non zero results. I understand this is because I'm using fq to get only documents which contain BANFF TRAIL but those documents also contain PARKDALE.

Facets return information for what the document that match the query
contain.  ALL of the information.  The query that returned those matches
is not examined at all when calculating facets, only the *results* of
the query are examined.  I don't think there's any way you can exclude
the information that you want to exclude, other than removing it from
the documents entirely.  I would imagine that the PARKDALE information
is required in those documents for other purposes and probably can't be
removed.

Thanks,
Shawn