Searching across spaces

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Searching across spaces

Robert Young-5
Hi,

How can I search accross spaces in the document when the spaces aren't
present in the search. For example, if the document contains
"spongebob squarepants" but the user searches on "sponge bob" I would
like to get the result.

Thanks
Rob

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Searching across spaces

Erick Erickson
I suspect you have to do some fancy indexing. That is, index the following
terms: sponge bob square pants spongebob squarepants.

But this requires that you understand all the variations you want to hit on
ahead of time.

Or, you could conceivably deal with wildcard queries, but I think this is
the same problem as indexing the many different terms.

Note that "Lucene in Action" has a section on indexing synonyms that could
be very helpful to you if you decide to index several terms, particularly if
you want span queries to operate in this space.

Best
Erick
Reply | Threaded
Open this post in threaded view
|

Re: Searching across spaces

Robert Young-5
Yes, I looked at the synonym sollution from Lucene in Action but, as
you point out, I have to know about it ahead of time. The only
sollution I've had so far is to index the term without the spaces as
well and then run two searches, one with spaces and one without. It
would work but it just seems like quite a ratty sollution.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Searching across spaces

Eric Isakson
In reply to this post by Robert Young-5
You might consider using overlapping bi-gram tokenization with stripped out whitespace and a PhraseQuery.

So your tokenized content, "spongebob squarepants", would look like:

sp po on ng ge eb bo ob bs sq qu ua ar re ep pa an nt ts

and your tokens for your query, "sponge bob", would look like

sp po on ng ge eb bo ob

Add each token to the PhraseQuery and you should match.

This is very similar to the techniques used for searching in Asian languages which do not seperate words with spaces. There are probably some side effects for compound words that you didn't mean to do this too, but without knowing the exact domain of compound words that you wish to support, this is probably the best you will be able to do.

-----Original Message-----
From: Robert Young [mailto:[hidden email]]
Sent: Wednesday, May 10, 2006 2:09 PM
To: [hidden email]
Subject: Searching across spaces

Hi,

How can I search accross spaces in the document when the spaces aren't present in the search. For example, if the document contains "spongebob squarepants" but the user searches on "sponge bob" I would like to get the result.

Thanks
Rob

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Searching across spaces

Maxym Mykhalchuk-2
Eric,

IMHO the number of side-effects can be reduced by requiring "phrases":
tokens for your query, "sponge bob", would look like
"sp po on ng ge" eb "bo ob"

Maxym

==================================
Maxym Mykhalchuk
(+39) 320 8593170
PhD student at University of Trento, ITALY
==================================
----- Original Message -----
From: "Eric Isakson" <[hidden email]>
To: <[hidden email]>
Sent: Thursday, May 11, 2006 3:54 PM
Subject: RE: Searching across spaces


> You might consider using overlapping bi-gram tokenization with stripped
> out whitespace and a PhraseQuery.
>
> So your tokenized content, "spongebob squarepants", would look like:
>
> sp po on ng ge eb bo ob bs sq qu ua ar re ep pa an nt ts
>
> and your tokens for your query, "sponge bob", would look like
>
> sp po on ng ge eb bo ob
>
> Add each token to the PhraseQuery and you should match.
>
> This is very similar to the techniques used for searching in Asian
> languages which do not seperate words with spaces. There are probably some
> side effects for compound words that you didn't mean to do this too, but
> without knowing the exact domain of compound words that you wish to
> support, this is probably the best you will be able to do.
>
> -----Original Message-----
> From: Robert Young [mailto:[hidden email]]
> Sent: Wednesday, May 10, 2006 2:09 PM
> To: [hidden email]
> Subject: Searching across spaces
>
> Hi,
>
> How can I search accross spaces in the document when the spaces aren't
> present in the search. For example, if the document contains "spongebob
> squarepants" but the user searches on "sponge bob" I would like to get the
> result.
>
> Thanks
> Rob
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Searching across spaces

Robert Young-5
In reply to this post by Eric Isakson
That sounds like just what I'm looking for. Do you know if this is
covered in Lucene in Action or where I can find more information about it.

Eric Isakson wrote:

>You might consider using overlapping bi-gram tokenization with stripped out whitespace and a PhraseQuery.
>
>So your tokenized content, "spongebob squarepants", would look like:
>
>sp po on ng ge eb bo ob bs sq qu ua ar re ep pa an nt ts
>
>and your tokens for your query, "sponge bob", would look like
>
>sp po on ng ge eb bo ob
>
>Add each token to the PhraseQuery and you should match.
>
>This is very similar to the techniques used for searching in Asian languages which do not seperate words with spaces. There are probably some side effects for compound words that you didn't mean to do this too, but without knowing the exact domain of compound words that you wish to support, this is probably the best you will be able to do.
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Searching across spaces

Eric Isakson
In reply to this post by Robert Young-5
I think you will have to write a custom analyzer and tokenizer to produce the tokens you need and you will have to arrange for whatever code you are using to create your query to use that analyzer in the correct circumstances. I don't think I've seen anyone post about this particular use case before, so I'd be surprised if there is much other information about it on the lists.

I haven't read Lucene in Action, so I don't know if it is covered there or not. If Erik has any information on indexing Asian languages, there might be some background there on using overlapping n-grams. Searching the list archive may give you some background if Lucene in Action doesn't have enough info on this topic.

-----Original Message-----
From: Rob Young [mailto:[hidden email]]
Sent: Thursday, May 11, 2006 11:39 AM
To: [hidden email]
Subject: Re: Searching across spaces

That sounds like just what I'm looking for. Do you know if this is covered in Lucene in Action or where I can find more information about it.

Eric Isakson wrote:

>You might consider using overlapping bi-gram tokenization with stripped out whitespace and a PhraseQuery.
>
>So your tokenized content, "spongebob squarepants", would look like:
>
>sp po on ng ge eb bo ob bs sq qu ua ar re ep pa an nt ts
>
>and your tokens for your query, "sponge bob", would look like
>
>sp po on ng ge eb bo ob
>
>Add each token to the PhraseQuery and you should match.
>
>This is very similar to the techniques used for searching in Asian languages which do not seperate words with spaces. There are probably some side effects for compound words that you didn't mean to do this too, but without knowing the exact domain of compound words that you wish to support, this is probably the best you will be able to do.
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Searching across spaces

Otis Gospodnetic-2
In reply to this post by Robert Young-5
Rob, look at the third hit:
  http://www.lucenebook.com/search?query=bi-grams

Otis

----- Original Message ----
From: Rob Young <[hidden email]>

> That sounds like just what I'm looking for. Do you know if this is
> covered in Lucene in Action or where I can find more information about it.

Eric Isakson wrote:

>You might consider using overlapping bi-gram tokenization with stripped out whitespace and a PhraseQuery.
>
>So your tokenized content, "spongebob squarepants", would look like:
>
>sp po on ng ge eb bo ob bs sq qu ua ar re ep pa an nt ts
>
>and your tokens for your query, "sponge bob", would look like
>
>sp po on ng ge eb bo ob
>
>Add each token to the PhraseQuery and you should match.
>
>This is very similar to the techniques used for searching in Asian languages which do not seperate words with spaces. There are probably some side effects for compound words that you didn't mean to do this too, but without knowing the exact domain of compound words that you wish to support, this is probably the best you will be able to do.





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]