Indexing a word in url

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing a word in url

Vinci
Hi all,

I would like to ask, if I want to index word in a URL, which data type and parser should I use?

Thank you,
Vinci
Reply | Threaded
Open this post in threaded view
|

Re: Indexing a word in url

Mike Klaas

On 31-Mar-08, at 10:50 AM, Vinci wrote:
>
> Hi all,
>
> I would like to ask, if I want to index word in a URL, which data  
> type and
> parser should I use?

Depends on how you want to search it.  I use WordDelimiterFilter with  
parts generation on only (no catenation), and an additiona stopwords  
like that excludes a few tokens like 'http'.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Indexing a word in url

Vinci
Hi,

Thank you for your reply.
Actually I want to use anything that is not alphabet or digit to be the separator - anything between them will be a word (so that I can use the URL fragment to see what is indexed about this site)...any suggestion?

Thank you,
Vinci

Mike Klaas wrote
On 31-Mar-08, at 10:50 AM, Vinci wrote:
>
> Hi all,
>
> I would like to ask, if I want to index word in a URL, which data  
> type and
> parser should I use?

Depends on how you want to search it.  I use WordDelimiterFilter with  
parts generation on only (no catenation), and an additiona stopwords  
like that excludes a few tokens like 'http'.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Indexing a word in url

hossman

: Actually I want to use anything that is not alphabet or digit to be the
: separator - anything between them will be a word (so that I can use the URL
: fragment to see what is indexed about this site)...any suggestion?

In addition to Mike's suggestion of trying out the WordDelimiterFilter,
take a look at the PatternTokenizerFactory.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Indexing a word in url

Simon Rosenthal
I also couldn't  get the exact results I wanted for indexing URL components
using WordDelimeterFilter or patternTokenizer, so resorted to adding a new
field ('pathparts'), plus a few lines of code to  generate the tokens in our
content preprocessor which submits documents to SOLR for indexing.

-Simon

On Tue, Apr 1, 2008 at 7:24 PM, Chris Hostetter <[hidden email]>
wrote:

>
> : Actually I want to use anything that is not alphabet or digit to be the
> : separator - anything between them will be a word (so that I can use the
> URL
> : fragment to see what is indexed about this site)...any suggestion?
>
> In addition to Mike's suggestion of trying out the WordDelimiterFilter,
> take a look at the PatternTokenizerFactory.
>
>
>
> -Hoss
>
>