Solr Newbie question: doubts about how to handle html content

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr Newbie question: doubts about how to handle html content

Marcio Pinto Motta
Solr Newbie question: doubts about html content



My "current" problem is to know the best approach to handle content which
have html code.



I have some docs that may or may not have html tag.



My first attempt, I defined a field "text" in my schema.xml :



  <field name="text" type="text" indexed="true" stored="true"/>
<field name="texto"> <br><p>   A Brasil Telecom … <br/><br/><br/>]]></field>


But some docs that have html code throw an error when I tried to send them
to Solr.



My second attempt, I put "<![CDATA[<br><p>   A Brasil Telecom …
<br/><br/><br/>]]>" and I could send the docs to Solr, and,  I could make a
search for "<br>" and retrieve the doc.



But consulting the result page source,  as you can see,

<str name="text">

&lt;br&gt;&lt;p&gt;  A Brasil Telecom ... </str>

the html code was "changed".





My third approach  is to create 2 fields in my schema:



. One with original content

. One with no html code, which will be indexed.



But I don't know how to preserve this html content in my new field. My
question is:

How to put these docs in Solr, search them, and retrieve de original <html>
content.



Thanks for attention.



BR,



Marcio
Reply | Threaded
Open this post in threaded view
|

Re: Solr Newbie question: doubts about how to handle html content

Erik Hatcher

On Oct 5, 2006, at 7:17 AM, Marcio Pinto Motta wrote:

> My "current" problem is to know the best approach to handle content  
> which
> have html code.
>
>
>
> I have some docs that may or may not have html tag.
>
>
>
> My first attempt, I defined a field "text" in my schema.xml :
>
>
>
>  <field name="text" type="text" indexed="true" stored="true"/>
> <field name="texto"> <br><p>   A Brasil Telecom … <br/><br/><br/>]]
> ></field>
>
>
> But some docs that have html code throw an error when I tried to  
> send them
> to Solr.

You must use CDATA or encode entities that have special meaning in  
XML.  I assume you're building the XML to POST to Solr as simply a  
string.  You definitely need to take encoding into consideration to  
avoid invalid XML.  I suspect whatever language you're communicating  
to Solr with has reasonable XML utilities you can leverage.

> My second attempt, I put "<![CDATA[<br><p>   A Brasil Telecom …
> <br/><br/><br/>]]>" and I could send the docs to Solr, and,  I  
> could make a
> search for "<br>" and retrieve the doc.
>
>
>
> But consulting the result page source,  as you can see,
>
> <str name="text">
>
> &lt;br&gt;&lt;p&gt;  A Brasil Telecom ... </str>
>
> the html code was "changed".

It wasn't "changed" per se... but rather it was encoded.  If you use  
an XML API to read the response you would not see these encoded  
characters.

> . One with original content
>
> . One with no html code, which will be indexed.
>
>
>
> But I don't know how to preserve this html content in my new field. My
> question is:
>
> How to put these docs in Solr, search them, and retrieve de  
> original <html>
> content.

What are your searching needs?  Are you really going to be searching  
on "<br>"?   If so, you need to consider the analysis of the text  
sent to Solr carefully (look at the admin page analysis utility for  
insight).  Regardless of what gets indexed, you can always store and  
retrieve the original text as long the field is marked as stored.

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: Solr Newbie question: doubts about how to handle html content

PanosJee
I think is not the best approach you can have...
And there is no need to index code since there are no results of any
use... Personally i would index the pure text and keep in a database the
code plus an id
so my db would like let 's say
id  text  text+code
so i would send to lucene id + text

 From within the webapp, page whatever where solr would just return an
id that i would use to retrieve the code with an XmlHttpRequest...
(ajaxian style)

____________________________________________________________________
http://www.freemail.gr - äùñåÜí õðçñåóßá çëåêôñïíéêïý ôá÷õäñïìåßïõ.
http://www.freemail.gr - free email service for the Greek-speaking.
Reply | Threaded
Open this post in threaded view
|

Re: Solr Newbie question: doubts about how to handle html content

Yonik Seeley-2
In reply to this post by Erik Hatcher
On 10/5/06, Erik Hatcher <[hidden email]> wrote:
> On Oct 5, 2006, at 7:17 AM, Marcio Pinto Motta wrote:
> > &lt;br&gt;&lt;p&gt;  A Brasil Telecom ... </str>
> >
> > the html code was "changed".
>
> It wasn't "changed" per se... but rather it was encoded.  If you use
> an XML API to read the response you would not see these encoded
> characters.

You can also use a different output syntax to verify that the internal
form is unchanged...
for example, add a wt=json to the HTTP parameters to see the results
in JSON format.

See HTMLStripWhitespaceTokenizerFactory if you don't want XML/HTML
tags indexed.  As Erik said, regardless of how you analyze a field,
you can always get an un-analyzed version back when you markthe field
as "stored".

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Solr Newbie question: doubts about how to handle html content

Marcio Pinto Motta
On 10/5/06, Yonik Seeley <[hidden email]> wrote:

>
> On 10/5/06, Erik Hatcher <[hidden email]> wrote:
> > On Oct 5, 2006, at 7:17 AM, Marcio Pinto Motta wrote:
> > > &lt;br&gt;&lt;p&gt;  A Brasil Telecom ... </str>
> > >
> > > the html code was "changed".
> >
> > It wasn't "changed" per se... but rather it was encoded.  If you use
> > an XML API to read the response you would not see these encoded
> > characters.
>
> You can also use a different output syntax to verify that the internal
> form is unchanged...
> for example, add a wt=json to the HTTP parameters to see the results
> in JSON format.
>
> See HTMLStripWhitespaceTokenizerFactory if you don't want XML/HTML
> tags indexed.  As Erik said, regardless of how you analyze a field,
> you can always get an un-analyzed version back when you markthe field
> as "stored".
>
> -Yonik
>


Hi folks,



What I want is avoid Data Base Server as much as it possible. I don't want
to allow "<>" searches, but is vital to retrieve the "text" in html content.
But also, I need the content ready to be show as soon as possible.
Approaches like solr.HTMLStripWhitespaceTokenizerFactory and  Json in Solr
are amazing, and very productive(saving a lot of code to be write).  More I
test, more I became amazed about it, and I don't test the replication yet
(which is my main goal) J



Thanks a lot for all responses (very quick J).



BR,



Marcio
Reply | Threaded
Open this post in threaded view
|

Sorting

maustin
I need to sort a query two ways. Should I do the search one way:
s.getDocListAndSet(query, restrictions, sort, req.getStart(),
req.getLimit(), flags);
then do the same search again with a different sort value or is there a
method available to just sort the DocSet (like sortDocSet but it's
protected)

OR maybe it doesn't  matter because caching will handle it anyway?

Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Sorting

Chris Hostetter-3

: I need to sort a query two ways. Should I do the search one way:
: s.getDocListAndSet(query, restrictions, sort, req.getStart(),
: req.getLimit(), flags);
: then do the same search again with a different sort value or is there a
: method available to just sort the DocSet (like sortDocSet but it's
: protected)
:
: OR maybe it doesn't  matter because caching will handle it anyway?

check this out from the example solrconfig.xml...

   <!-- An optimization that attempts to use a filter to satisfy a search.
         If the requested sort does not include score, then the filterCache
         will be checked for a filter matching the query. If found, the filter
         will be used as the source of document ids, and then the sort will be
         applied to that.  -->
    <useFilterForSortedQuery>true</useFilterForSortedQuery>

...in those conditions, you should be able to just call getDocList (or
getDocListAndSet) with your various Sort options and the cache will take
care of everything.

if you *do* want scores to be included in one of the Sorts, then i would
try doing that search first using getDocListAndSet -- you can ignore the
DocSet, but the next call to getDocList should leverage the filterCache,
and the initial getDocListANdSet call hsould be faster then two seperate
getDocList calls with different sorts...

                                        ...i think.




-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Sorting

maustin
Let me back up.. for a second. I want to create price ranges. I was thinking
that I would do a search with a sort on price and create ranges by getting
the document price every (docCount / #ofpricerangesIwant). Basically create:
< 10, 10 - 60, 60 - 100 etc.. If the initial search wasn't sorted by price
then I would have to do the second search just to figure out the price
ranges.

This was the only way I could think to do it. Maybe I'm going at this the
wrong way?

Thanks

On 10/11/06, Chris Hostetter <[hidden email]> wrote:

>
>
> : I need to sort a query two ways. Should I do the search one way:
> : s.getDocListAndSet(query, restrictions, sort, req.getStart(),
> : req.getLimit(), flags);
> : then do the same search again with a different sort value or is there a
> : method available to just sort the DocSet (like sortDocSet but it's
> : protected)
> :
> : OR maybe it doesn't  matter because caching will handle it anyway?
>
> check this out from the example solrconfig.xml...
>
>   <!-- An optimization that attempts to use a filter to satisfy a search.
>         If the requested sort does not include score, then the filterCache
>         will be checked for a filter matching the query. If found, the
> filter
>         will be used as the source of document ids, and then the sort will
> be
>         applied to that.  -->
>    <useFilterForSortedQuery>true</useFilterForSortedQuery>
>
> ...in those conditions, you should be able to just call getDocList (or
> getDocListAndSet) with your various Sort options and the cache will take
> care of everything.
>
> if you *do* want scores to be included in one of the Sorts, then i would
> try doing that search first using getDocListAndSet -- you can ignore the
> DocSet, but the next call to getDocList should leverage the filterCache,
> and the initial getDocListANdSet call hsould be faster then two seperate
> getDocList calls with different sorts...
>
>                                        ...i think.
>
>
>
>
> -Hoss
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Sorting

Chris Hostetter-3

: Let me back up.. for a second. I want to create price ranges. I was thinking
: that I would do a search with a sort on price and create ranges by getting
: the document price every (docCount / #ofpricerangesIwant). Basically create:
: < 10, 10 - 60, 60 - 100 etc.. If the initial search wasn't sorted by price
: then I would have to do the second search just to figure out the price
: ranges.
:
: This was the only way I could think to do it. Maybe I'm going at this the
: wrong way?

that's certianly one way to do it ... it would probably be faster though
to use the TermEnum of the price field directly.

but bear in mind that getting ranges that look "clean"  is probably harder
then you realize.  i've yet to really see a good appraoch to progomaticaly
determining (non-trivial) numeric ranges, personally i think that to have
good looking ranges you pretty much have to have them picked by a person
and stored in metadata.  i had some comments on this in discussion a
little while back...

http://www.nabble.com/forum/ViewPost.jtp?post=3753053&framed=y


: On 10/11/06, Chris Hostetter <[hidden email]> wrote:
: >
: >
: > : I need to sort a query two ways. Should I do the search one way:
: > : s.getDocListAndSet(query, restrictions, sort, req.getStart(),
: > : req.getLimit(), flags);
: > : then do the same search again with a different sort value or is there a
: > : method available to just sort the DocSet (like sortDocSet but it's
: > : protected)
: > :
: > : OR maybe it doesn't  matter because caching will handle it anyway?
: >
: > check this out from the example solrconfig.xml...
: >
: >   <!-- An optimization that attempts to use a filter to satisfy a search.
: >         If the requested sort does not include score, then the filterCache
: >         will be checked for a filter matching the query. If found, the
: > filter
: >         will be used as the source of document ids, and then the sort will
: > be
: >         applied to that.  -->
: >    <useFilterForSortedQuery>true</useFilterForSortedQuery>
: >
: > ...in those conditions, you should be able to just call getDocList (or
: > getDocListAndSet) with your various Sort options and the cache will take
: > care of everything.
: >
: > if you *do* want scores to be included in one of the Sorts, then i would
: > try doing that search first using getDocListAndSet -- you can ignore the
: > DocSet, but the next call to getDocList should leverage the filterCache,
: > and the initial getDocListANdSet call hsould be faster then two seperate
: > getDocList calls with different sorts...
: >
: >                                        ...i think.
: >
: >
: >
: >
: > -Hoss
: >
: >
:



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Sorting

maustin
>
> that's certianly one way to do it ... it would probably be faster though
> to use the TermEnum of the price field directly.


I will look into this.


i've yet to really see a good appraoch to progomaticaly
> determining (non-trivial) numeric ranges, personally i think that to have
> good looking ranges you pretty much have to have them picked by a person
> and stored in metadata.


I have thought about just doing it this way also. I originally did it this
way but it would be nice to have different buckets depending on the result
set that you have in the category
vs.
using the same buckets for any result set in a category even though they
don't make as much sense anymore depending on the facets selected.

i had some comments on this in discussion a
> little while back...
>
> http://www.nabble.com/forum/ViewPost.jtp?post=3753053&framed=y


I will have to read through this again but it looks like a good discussion.

Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Sorting

Chris Hostetter-3

: way but it would be nice to have different buckets depending on the result
: set that you have in the category
: vs.
: using the same buckets for any result set in a category even though they
: don't make as much sense anymore depending on the facets selected.

that's a very slippery slop .. i suspect a lot of users would be put-off
by ranges that kept changing on them as they added/removed other facet
constraints -- one minute you are seeing ranges like 0-10,11-20,20-30 and
then you say you are only interested in red products and now your ranges
are 0-12,23-20,21-30 ... ? ... that might be a little confusing.

especially if a user wants to try and come back to your site and recreate
a previous experience:  they may not remember that they first selected
"color" then "price" and then "size" ... if they come back and pick the
color, and the size you may now offer them completley different "price"
choices then they saw before, and they won't be able to get to the page
they where looking at before.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Sorting

maustin
>
> that's a very slippery slop .. i suspect a lot of users would be put-off
> by ranges that kept changing on them as they added/removed other facet
> constraints -- one minute you are seeing ranges like 0-10,11-20,20-30 and
> then you say you are only interested in red products and now your ranges
> are 0-12,23-20,21-30 ... ? ... that might be a little confusing.


Very good point.



> picked by a person and stored in metadata
>

This is my current design.. I think I will stick with this after your
comments, thinking about it more, and the troubles that I'm having to get it
just right.. Plus, the inflexibility by not allowing me to specify the label
in different ways like I do or can do now.  I'm trying to make it as
automated as possible being that I have limited human resources. However, I
did make it easy to update and regenerate my facet config files so it
wouldn't be bad now, plus the ability to use me own labels for certain
buckets is nice.

Thanks