How to not tokenize HTML tag from input string

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

How to not tokenize HTML tag from input string

Joe Tang
My work is to index keywords with a document. In my case, the document is made up with HTML tags which i don't want to index them.

For example:
Input Document:
<div id="tp-wrapper">
You are welcome 
<div id="tp-tab"> 

Testing text


</div>
</div>

Expected Keywords:
keywords:You
keywords:are
keywords:welcome
keywords:Testing
keywords:text

Is there anyway I can make them not to be one of the keywords?
Reply | Threaded
Open this post in threaded view
|

Re: How to not tokenize HTML tag from input string

Robert Engels
ask on the user-list, actually search the archives first

On Feb 7, 2007, at 6:10 PM, Joe Tang wrote:

>
> My work is to index keywords with a document. In my case, the  
> document is
> made up with HTML tags which i don't want to index them.
>
> For example:
> Input Document:
> <div id="tp-wrapper">
> <span id="tp-top-right">You are welcome</span>
> <div id="tp-tab">
> <h1>Testing text</h1>
> /images/gui/tab_grey_bkg_lftend.gif
> </div>
> </div>
>
> Expected Keywords:
> keywords:You
> keywords:are
> keywords:welcome
> keywords:Testing
> keywords:text
>
> Is there anyway I can make them not to be one of the keywords?
> --
> View this message in context: http://www.nabble.com/How-to-not- 
> tokenize-HTML-tag-from-input-string-tf3190611.html#a8857238
> Sent from the Lucene - Java Developer mailing list archive at  
> Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]