Faceting

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Faceting

José Moreira-3
Hello,

I'm planning to index a 'content' field for search and from that
fields text content i would like to facet (probably) according to if
the content has e-mails, urls and within urls, url's to pictures,
videos and others.

As i'm a relatively new user to Solr, my plan was to regexp the
content in my application and add tags to a Solr field according to
the content, so for example the content "[hidden email]
http://www.site.com" would have the tags "email, link".

If i follow this path can i then facet on "email" and/or "link" ? For
example combining facet field with facet value params?

Best

--
http://pt.linkedin.com/in/josemoreira
[hidden email]
http://djangopeople.net/josemoreira/
Reply | Threaded
Open this post in threaded view
|

Re: Faceting

Jan Høydahl / Cominvent
NOTE: Please start a new email thread for a new topic (See http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking)

Your strategy could work. You might want to look into dedicated entity extraction frameworks like
http://opennlp.sourceforge.net/
http://nlp.stanford.edu/software/CRF-NER.shtml
http://incubator.apache.org/uima/index.html

Or if that is too much work, look at http://issues.apache.org/jira/browse/SOLR-1725 for a way to plug in your entity extraction code into Solr itself using a scripting language.

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 5. feb. 2010, at 20.10, José Moreira wrote:

> Hello,
>
> I'm planning to index a 'content' field for search and from that
> fields text content i would like to facet (probably) according to if
> the content has e-mails, urls and within urls, url's to pictures,
> videos and others.
>
> As i'm a relatively new user to Solr, my plan was to regexp the
> content in my application and add tags to a Solr field according to
> the content, so for example the content "[hidden email]
> http://www.site.com" would have the tags "email, link".
>
> If i follow this path can i then facet on "email" and/or "link" ? For
> example combining facet field with facet value params?
>
> Best
>
> --
> http://pt.linkedin.com/in/josemoreira
> [hidden email]
> http://djangopeople.net/josemoreira/

Reply | Threaded
Open this post in threaded view
|

Re: Faceting

hossman

: NOTE: Please start a new email thread for a new topic (See
: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking)

FWIW: I'm the most nit-picky person i know about Thread-Hijacking, but i
don't see any MIME headers to indicate that Jose did that).

: > If i follow this path can i then facet on "email" and/or "link" ? For
: > example combining facet field with facet value params?

Any indexed field can be faceted on ... it's hard to be sure what exactly
your goal is, but if you ultimately want to be able to have a list of
search results, and then display facet info like "Number of results
containing an email address" and "Number of results containing a URL" then
yes: as long as you have a way of extracting that metadata and including
it in an indexed field, you can facet on it ... you could use Field
Faceting on something like a "properities: field (where all the indexed
values are "contains_email" and "containes_url", etc...) or you could use
facet queries to check arbitrary criteria (ie: facet.query=has_email:true
& facet.query=urls:[* TO *], etc...



-Hoss
Reply | Threaded
Open this post in threaded view
|

Re: Faceting

Jan Høydahl / Cominvent
Regarding hi-jacking, that was a false alarm. Apple Mail fooled me to believe it was part of another thread. Sorry Jose.

I think the "properties" field approach is clean. It relies on index-time classification which is where such heavy-lifting should preferrably be done. Faceting on a multi-valued string field should work very well for this.

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 11. feb. 2010, at 01.47, Chris Hostetter wrote:

>
> : NOTE: Please start a new email thread for a new topic (See
> : http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking)
>
> FWIW: I'm the most nit-picky person i know about Thread-Hijacking, but i
> don't see any MIME headers to indicate that Jose did that).
>
> : > If i follow this path can i then facet on "email" and/or "link" ? For
> : > example combining facet field with facet value params?
>
> Any indexed field can be faceted on ... it's hard to be sure what exactly
> your goal is, but if you ultimately want to be able to have a list of
> search results, and then display facet info like "Number of results
> containing an email address" and "Number of results containing a URL" then
> yes: as long as you have a way of extracting that metadata and including
> it in an indexed field, you can facet on it ... you could use Field
> Faceting on something like a "properities: field (where all the indexed
> values are "contains_email" and "containes_url", etc...) or you could use
> facet queries to check arbitrary criteria (ie: facet.query=has_email:true
> & facet.query=urls:[* TO *], etc...
>
>
>
> -Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Faceting

Otis Gospodnetic-2
In reply to this post by Jan Høydahl / Cominvent
Note that UIMA doesn't doe NER itself (as far as I know), but instead relies on GATE or OpenNLP or OpenCalais, AFAIK :)

Those interested in UIMA and living close to New York should go to http://www.meetup.com/NYC-Search-and-Discovery/calendar/12384559/


Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



----- Original Message ----

> From: Jan Høydahl / Cominvent <[hidden email]>
> To: [hidden email]
> Sent: Tue, February 9, 2010 9:57:26 AM
> Subject: Re: Faceting
>
> NOTE: Please start a new email thread for a new topic (See
> http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking)
>
> Your strategy could work. You might want to look into dedicated entity
> extraction frameworks like
> http://opennlp.sourceforge.net/
> http://nlp.stanford.edu/software/CRF-NER.shtml
> http://incubator.apache.org/uima/index.html
>
> Or if that is too much work, look at
> http://issues.apache.org/jira/browse/SOLR-1725 for a way to plug in your entity
> extraction code into Solr itself using a scripting language.
>
> --
> Jan Høydahl  - search architect
> Cominvent AS - www.cominvent.com
>
> On 5. feb. 2010, at 20.10, José Moreira wrote:
>
> > Hello,
> >
> > I'm planning to index a 'content' field for search and from that
> > fields text content i would like to facet (probably) according to if
> > the content has e-mails, urls and within urls, url's to pictures,
> > videos and others.
> >
> > As i'm a relatively new user to Solr, my plan was to regexp the
> > content in my application and add tags to a Solr field according to
> > the content, so for example the content "[hidden email]
> > http://www.site.com" would have the tags "email, link".
> >
> > If i follow this path can i then facet on "email" and/or "link" ? For
> > example combining facet field with facet value params?
> >
> > Best
> >
> > --
> > http://pt.linkedin.com/in/josemoreira
> > [hidden email]
> > http://djangopeople.net/josemoreira/

Reply | Threaded
Open this post in threaded view
|

Re: Faceting

José Moreira-3
have you used UIMA? i did a quick read on the docs and it seems to do what
i'm looking for.

2010/2/11 Otis Gospodnetic <[hidden email]>

> Note that UIMA doesn't doe NER itself (as far as I know), but instead
> relies on GATE or OpenNLP or OpenCalais, AFAIK :)
>
> Those interested in UIMA and living close to New York should go to
> http://www.meetup.com/NYC-Search-and-Discovery/calendar/12384559/
>
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
>
>
>
> ----- Original Message ----
> > From: Jan Høydahl / Cominvent <[hidden email]>
> > To: [hidden email]
> > Sent: Tue, February 9, 2010 9:57:26 AM
> > Subject: Re: Faceting
> >
> > NOTE: Please start a new email thread for a new topic (See
> > http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking)
> >
> > Your strategy could work. You might want to look into dedicated entity
> > extraction frameworks like
> > http://opennlp.sourceforge.net/
> > http://nlp.stanford.edu/software/CRF-NER.shtml
> > http://incubator.apache.org/uima/index.html
> >
> > Or if that is too much work, look at
> > http://issues.apache.org/jira/browse/SOLR-1725 for a way to plug in your
> entity
> > extraction code into Solr itself using a scripting language.
> >
> > --
> > Jan Høydahl  - search architect
> > Cominvent AS - www.cominvent.com
> >
> > On 5. feb. 2010, at 20.10, José Moreira wrote:
> >
> > > Hello,
> > >
> > > I'm planning to index a 'content' field for search and from that
> > > fields text content i would like to facet (probably) according to if
> > > the content has e-mails, urls and within urls, url's to pictures,
> > > videos and others.
> > >
> > > As i'm a relatively new user to Solr, my plan was to regexp the
> > > content in my application and add tags to a Solr field according to
> > > the content, so for example the content "[hidden email]
> > > http://www.site.com" would have the tags "email, link".
> > >
> > > If i follow this path can i then facet on "email" and/or "link" ? For
> > > example combining facet field with facet value params?
> > >
> > > Best
> > >
> > > --
> > > http://pt.linkedin.com/in/josemoreira
> > > [hidden email]
> > > http://djangopeople.net/josemoreira/
>
>


--
[hidden email]
http://pt.linkedin.com/in/josemoreira
http://djangopeople.net/josemoreira/
Reply | Threaded
Open this post in threaded view
|

Re: Faceting

Lance Norskog-2
There are several component libraries for UIMA on the net:
http://incubator.apache.org/uima/external-resources.html

2010/2/18 José Moreira <[hidden email]>:

> have you used UIMA? i did a quick read on the docs and it seems to do what
> i'm looking for.
>
> 2010/2/11 Otis Gospodnetic <[hidden email]>
>
>> Note that UIMA doesn't doe NER itself (as far as I know), but instead
>> relies on GATE or OpenNLP or OpenCalais, AFAIK :)
>>
>> Those interested in UIMA and living close to New York should go to
>> http://www.meetup.com/NYC-Search-and-Discovery/calendar/12384559/
>>
>>
>> Otis
>> ----
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Hadoop ecosystem search :: http://search-hadoop.com/
>>
>>
>>
>> ----- Original Message ----
>> > From: Jan Høydahl / Cominvent <[hidden email]>
>> > To: [hidden email]
>> > Sent: Tue, February 9, 2010 9:57:26 AM
>> > Subject: Re: Faceting
>> >
>> > NOTE: Please start a new email thread for a new topic (See
>> > http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking)
>> >
>> > Your strategy could work. You might want to look into dedicated entity
>> > extraction frameworks like
>> > http://opennlp.sourceforge.net/
>> > http://nlp.stanford.edu/software/CRF-NER.shtml
>> > http://incubator.apache.org/uima/index.html
>> >
>> > Or if that is too much work, look at
>> > http://issues.apache.org/jira/browse/SOLR-1725 for a way to plug in your
>> entity
>> > extraction code into Solr itself using a scripting language.
>> >
>> > --
>> > Jan Høydahl  - search architect
>> > Cominvent AS - www.cominvent.com
>> >
>> > On 5. feb. 2010, at 20.10, José Moreira wrote:
>> >
>> > > Hello,
>> > >
>> > > I'm planning to index a 'content' field for search and from that
>> > > fields text content i would like to facet (probably) according to if
>> > > the content has e-mails, urls and within urls, url's to pictures,
>> > > videos and others.
>> > >
>> > > As i'm a relatively new user to Solr, my plan was to regexp the
>> > > content in my application and add tags to a Solr field according to
>> > > the content, so for example the content "[hidden email]
>> > > http://www.site.com" would have the tags "email, link".
>> > >
>> > > If i follow this path can i then facet on "email" and/or "link" ? For
>> > > example combining facet field with facet value params?
>> > >
>> > > Best
>> > >
>> > > --
>> > > http://pt.linkedin.com/in/josemoreira
>> > > [hidden email]
>> > > http://djangopeople.net/josemoreira/
>>
>>
>
>
> --
> [hidden email]
> http://pt.linkedin.com/in/josemoreira
> http://djangopeople.net/josemoreira/
>



--
Lance Norskog
[hidden email]