problem with solr.HTMLStripWhitespaceTokenizerFactory

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

problem with solr.HTMLStripWhitespaceTokenizerFactory

mike topper
I'm trying to use the html stripping factory in order to strip html tags
from my description field when indexing.

I added this fieldtype:

    <fieldtype name="text_html" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
          <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true"/>
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
          <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true"/>
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldtype>


and then in my schema i have this:

<field name="description"            type="text_html"    indexed="true" stored="true"/>



when inserting it it seems like nothing happens ie when i do a query
here is the response for a test description:

<str name="description">

<br>hi<br>my<br>name<br>is<br>topper<br>and this <b>&nbsp;blahblah</b> is a <b>test</b>

</str>




Any Ideas?

-Mike

Reply | Threaded
Open this post in threaded view
|

Re: problem with solr.HTMLStripWhitespaceTokenizerFactory

Yonik Seeley-2
On 3/6/07, mike topper <[hidden email]> wrote:
> when inserting it it seems like nothing happens ie when i do a query
> here is the response for a test description:
>
> <str name="description">
>
> <br>hi<br>my<br>name<br>is<br>topper<br>and this <b>&nbsp;blahblah</b> is a <b>test</b>
>
> </str>

The tag stripping happens during the analysis phase, and affects what
gets indexed.
For returned field values, you get what you put in.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Time after snapshot is "visible" on the slave

galo-2
Hi,

I've been testing index replication and after snappulling and installing
the latest version of the master index, if i run a query on the slave i
don't get any results back (tried a commit in despair, which didn't work
either). If I restart the web server (tomcat) then it works.

Am I missing any steps or just being too impatient sending queries?

Cheers

--
Galo Navarro, Developer

[hidden email]
t. +44 (0)20 7780 7080

Last.fm | http://www.last.fm
Karen House 1-11 Baches Street
London N1 6DL

http://www.last.fm/user/galeote

Reply | Threaded
Open this post in threaded view
|

RE: Time after snapshot is "visible" on the slave

Graham Stead-2
Hi Galo,

The snapinstaller actually performs a commit as its last step, so if that
didn't work, it's not surprising that running commit separately didn't work,
either.

I would suggest running the snapinstaller and/or commit scripts with the -V
option. This will produce verbose debugging information and allow you to see
where they encounter problems.

Hope this helps,
-Graham


Reply | Threaded
Open this post in threaded view
|

RE: Time after snapshot is "visible" on the slave

Graham Stead-2
I forgot to mention that the admin page (solr/admin/stats.jsp) is an
excellent way to see when the last searcher was opened. After running
commit, you should see update to the openedAt and registeredAt timestamps,
e.g.,:

openedAt : Tue Mar 06 08:14:19 PST 2007
registeredAt : Tue Mar 06 08:15:55 PST 2007

If you have added documents, you'll numDocs and/or maxDoc change as well.

If you don't see these update then something isn't right. If you see them
update but cannot find your documents in the index, then your indexing
process may not be working correctly.

Hope this helps,
-Graham

PS: If you are running replication with multiple solr instances, your
problem may be caused by a simple bug in the commit, optimize, and
readercycle scripts. Replace the /solr/ in the curl statement with
${webapp_name}:

From:
rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d "<commit/>"`

To:
rs=`curl http://${solr_hostname}:${solr_port}/${webapp_name}/update -s -d
"<commit/>"`

I haven't had time to commit these bug fixes yet.


Reply | Threaded
Open this post in threaded view
|

Re: Time after snapshot is "visible" on the slave

galo-2
In reply to this post by Graham Stead-2
Yep, the snapinstaller was failing and it was the same problem as Jeff
posted this morning about bin/optimize, but this time with bin/commit,
not using ${webapp_name}.

I fixed that and worked normally.  I've submitted a bug to JIRA as I
think Jeff didn't submit it yet

Mm now I see your other email.. oh well..

Thanks for your help,

Graham Stead wrote:

> Hi Galo,
>
> The snapinstaller actually performs a commit as its last step, so if that
> didn't work, it's not surprising that running commit separately didn't work,
> either.
>
> I would suggest running the snapinstaller and/or commit scripts with the -V
> option. This will produce verbose debugging information and allow you to see
> where they encounter problems.
>
> Hope this helps,
> -Graham
>
>
>
>  


--
Galo Navarro, Developer

[hidden email]
t. +44 (0)20 7780 7080

Last.fm | http://www.last.fm
Karen House 1-11 Baches Street
London N1 6DL

http://www.last.fm/user/galeote