combine wildcard and phrase query

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

combine wildcard and phrase query

JensBurkhardt
hey everybody,

I'm wondering if it's possible to combine wildcards and phrase query.

For example "term1 term*"

I know that the documentation says "Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries)" but maybe someone has had the same problem and found a solution.

Thanks for your help

Jens Burkhardt
Reply | Threaded
Open this post in threaded view
|

Re: combine wildcard and phrase query

JensBurkhardt
okay, another problem occured. I have different fields with the same name. I can't seperate them like naming them field1 field2 etc. cause while indexing i don't know how many fields i will need.
Like a book has several signature numbers i want to save them in a field signature and when i search for such a number i want the search hit every single field and not all fields together.
Right now i separate the string using an unique separator (in this case just $$$) so i can split the string into the numbers but i think this is kinda the worst form doing it.



JensBurkhardt wrote
hey everybody,

I'm wondering if it's possible to combine wildcards and phrase query.

For example "term1 term*"

I know that the documentation says "Lucene supports single and multiple character wildcard searches within single terms (not within phrase queries)" but maybe someone has had the same problem and found a solution.

Thanks for your help

Jens Burkhardt
Reply | Threaded
Open this post in threaded view
|

Re: combine wildcard and phrase query

Erick Erickson
No, as far as I know you can't combine wildcards in phrases. This would
get extraordinarily ugly extraordinarily quickly. The way Lucene handles
wildcards (conceputally) is to expand all the possible terms into a large OR
clause. Say my index contains term1, term2, and term3. The search for term*
really expands into term1 OR term2 OR term3. Now imagine the
complexity of a phrase like "dog* cat* hors*". Now say your index contained
10 terms starting with dog, 10 with cat and 10 with hors. You'd have 1,000
ORed phrase queries. And this is a tiny example....

You can try various approximations, and depending upon your index size they
may or may not work. For instance, you could index all the successive
shorter
forms. with increments of 0 (see synonym analyzer)  I.e. index horse, hors$
hor$
ho$ h$ all in the same position. Then searching for hor* becomes searching
for
hor$ and it all "just works". Of course this makes your index bigger.....

About your second issue: I'm not clear what your trying to accomplish. It's
no
problem to add the same field multiple times for a document. That is, you
can
doc.add(new field("field1", ......)
doc.add(new field("field1", ......)
doc.add(new field("field1", ......)
doc.add(new field("field1", ......)
as many times as you want before you add the document to the index. For
retrieval you can call getFields ("field1") and get an array of Fields back,
one
for each call to add above. You can also set the PositionIncrementGap while
indexing to separate the termposition of the first term of successive add()
calls
by, say, 100 (or whatever) if you need to worry about SpanNear or some such.

This may be waaaay off base. If so, could you give a concrete example of
what
your inputs are and how you want to search them?

Best
Erick

On Thu, Mar 6, 2008 at 7:28 AM, JensBurkhardt <[hidden email]> wrote:

>
> okay, another problem occured. I have different fields with the same name.
> I
> can't seperate them like naming them field1 field2 etc. cause while
> indexing
> i don't know how many fields i will need.
> Like a book has several signature numbers i want to save them in a field
> signature and when i search for such a number i want the search hit every
> single field and not all fields together.
> Right now i separate the string using an unique separator (in this case
> just
> $$$) so i can split the string into the numbers but i think this is kinda
> the worst form doing it.
>
>
>
>
> JensBurkhardt wrote:
> >
> > hey everybody,
> >
> > I'm wondering if it's possible to combine wildcards and phrase query.
> >
> > For example "term1 term*"
> >
> > I know that the documentation says "Lucene supports single and multiple
> > character wildcard searches within single terms (not within phrase
> > queries)" but maybe someone has had the same problem and found a
> solution.
> >
> > Thanks for your help
> >
> > Jens Burkhardt
> >
>
> --
> View this message in context:
> http://www.nabble.com/combine-wildcard-and-phrase-query-tp15870647p15872169.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: combine wildcard and phrase query

JensBurkhardt
okay thanks. the first thing was what i've expected :-) . well about my second issue,
i was totally wrong. Just forget what i've said! I had in mind that if i have several fields with the same name
these fields are connected to a big string.
Now as i read your message i remember that this behavior is just with boost-values ;-) and not with field-values. hell no ...
thanks for your answers :-). You saved a lot of time ;-)

best regards
Jens

Erick Erickson wrote
No, as far as I know you can't combine wildcards in phrases. This would
get extraordinarily ugly extraordinarily quickly. The way Lucene handles
wildcards (conceputally) is to expand all the possible terms into a large OR
clause. Say my index contains term1, term2, and term3. The search for term*
really expands into term1 OR term2 OR term3. Now imagine the
complexity of a phrase like "dog* cat* hors*". Now say your index contained
10 terms starting with dog, 10 with cat and 10 with hors. You'd have 1,000
ORed phrase queries. And this is a tiny example....

You can try various approximations, and depending upon your index size they
may or may not work. For instance, you could index all the successive
shorter
forms. with increments of 0 (see synonym analyzer)  I.e. index horse, hors$
hor$
ho$ h$ all in the same position. Then searching for hor* becomes searching
for
hor$ and it all "just works". Of course this makes your index bigger.....

About your second issue: I'm not clear what your trying to accomplish. It's
no
problem to add the same field multiple times for a document. That is, you
can
doc.add(new field("field1", ......)
doc.add(new field("field1", ......)
doc.add(new field("field1", ......)
doc.add(new field("field1", ......)
as many times as you want before you add the document to the index. For
retrieval you can call getFields ("field1") and get an array of Fields back,
one
for each call to add above. You can also set the PositionIncrementGap while
indexing to separate the termposition of the first term of successive add()
calls
by, say, 100 (or whatever) if you need to worry about SpanNear or some such.

This may be waaaay off base. If so, could you give a concrete example of
what
your inputs are and how you want to search them?

Best
Erick

On Thu, Mar 6, 2008 at 7:28 AM, JensBurkhardt <jensburkhardt@web.de> wrote:

>
> okay, another problem occured. I have different fields with the same name.
> I
> can't seperate them like naming them field1 field2 etc. cause while
> indexing
> i don't know how many fields i will need.
> Like a book has several signature numbers i want to save them in a field
> signature and when i search for such a number i want the search hit every
> single field and not all fields together.
> Right now i separate the string using an unique separator (in this case
> just
> $$$) so i can split the string into the numbers but i think this is kinda
> the worst form doing it.
>
>
>
>
> JensBurkhardt wrote:
> >
> > hey everybody,
> >
> > I'm wondering if it's possible to combine wildcards and phrase query.
> >
> > For example "term1 term*"
> >
> > I know that the documentation says "Lucene supports single and multiple
> > character wildcard searches within single terms (not within phrase
> > queries)" but maybe someone has had the same problem and found a
> solution.
> >
> > Thanks for your help
> >
> > Jens Burkhardt
> >
>
> --
> View this message in context:
> http://www.nabble.com/combine-wildcard-and-phrase-query-tp15870647p15872169.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Reply | Threaded
Open this post in threaded view
|

Re: combine wildcard and phrase query

JensBurkhardt
In reply to this post by Erick Erickson
hi again,

referring to my second issue, i've got another question. I mean, this field thing works pretty well but:
My fields look like:
signature: LA A 100
signature: LA A 201
signature: LA A 202
signature: LA B 200
signature: LC B 300
Now i use getFields and search them.
Let's assume i'm searching for a signature like "LA B 200". If i use a phrase query, no problem. I search all the fields and the only if the field value and query exactly match, i get a hit.
But what if you want to use wildcards and search for something like LA A 20*. Now all the LA signatures will be in my results even if i just want two of them. The problems are the blanks but i have no idea how this could work.

Thanks and have a nice evening

Jens

Erick Erickson wrote
No, as far as I know you can't combine wildcards in phrases. This would
get extraordinarily ugly extraordinarily quickly. The way Lucene handles
wildcards (conceputally) is to expand all the possible terms into a large OR
clause. Say my index contains term1, term2, and term3. The search for term*
really expands into term1 OR term2 OR term3. Now imagine the
complexity of a phrase like "dog* cat* hors*". Now say your index contained
10 terms starting with dog, 10 with cat and 10 with hors. You'd have 1,000
ORed phrase queries. And this is a tiny example....

You can try various approximations, and depending upon your index size they
may or may not work. For instance, you could index all the successive
shorter
forms. with increments of 0 (see synonym analyzer)  I.e. index horse, hors$
hor$
ho$ h$ all in the same position. Then searching for hor* becomes searching
for
hor$ and it all "just works". Of course this makes your index bigger.....

About your second issue: I'm not clear what your trying to accomplish. It's
no
problem to add the same field multiple times for a document. That is, you
can
doc.add(new field("field1", ......)
doc.add(new field("field1", ......)
doc.add(new field("field1", ......)
doc.add(new field("field1", ......)
as many times as you want before you add the document to the index. For
retrieval you can call getFields ("field1") and get an array of Fields back,
one
for each call to add above. You can also set the PositionIncrementGap while
indexing to separate the termposition of the first term of successive add()
calls
by, say, 100 (or whatever) if you need to worry about SpanNear or some such.

This may be waaaay off base. If so, could you give a concrete example of
what
your inputs are and how you want to search them?

Best
Erick

On Thu, Mar 6, 2008 at 7:28 AM, JensBurkhardt <jensburkhardt@web.de> wrote:

>
> okay, another problem occured. I have different fields with the same name.
> I
> can't seperate them like naming them field1 field2 etc. cause while
> indexing
> i don't know how many fields i will need.
> Like a book has several signature numbers i want to save them in a field
> signature and when i search for such a number i want the search hit every
> single field and not all fields together.
> Right now i separate the string using an unique separator (in this case
> just
> $$$) so i can split the string into the numbers but i think this is kinda
> the worst form doing it.
>
>
>
>
> JensBurkhardt wrote:
> >
> > hey everybody,
> >
> > I'm wondering if it's possible to combine wildcards and phrase query.
> >
> > For example "term1 term*"
> >
> > I know that the documentation says "Lucene supports single and multiple
> > character wildcard searches within single terms (not within phrase
> > queries)" but maybe someone has had the same problem and found a
> solution.
> >
> > Thanks for your help
> >
> > Jens Burkhardt
> >
>
> --
> View this message in context:
> http://www.nabble.com/combine-wildcard-and-phrase-query-tp15870647p15872169.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Reply | Threaded
Open this post in threaded view
|

Re: combine wildcard and phrase query

Erick Erickson
Have you considered indexing that field UN_TOKENIZED? Make sure
you build your queries that way too. I'm not at *all* clear about
how this works with wildcards, so you'll have to test that.

This assumes you never want to just be able to search on LA
and get a hit.

Best
Erick

On Fri, Mar 7, 2008 at 10:35 AM, JensBurkhardt <[hidden email]> wrote:

>
> hi again,
>
> referring to my second issue, i've got another question. I mean, this
> field
> thing works pretty well but:
> My fields look like:
> signature: LA A 100
> signature: LA A 201
> signature: LA A 202
> signature: LA B 200
> signature: LC B 300
> Now i use getFields and search them.
> Let's assume i'm searching for a signature like "LA B 200". If i use a
> phrase query, no problem. I search all the fields and the only if the
> field
> value and query exactly match, i get a hit.
> But what if you want to use wildcards and search for something like LA A
> 20*. Now all the LA signatures will be in my results even if i just want
> two
> of them. The problems are the blanks but i have no idea how this could
> work.
>
> Thanks and have a nice evening
>
> Jens
>
>
> Erick Erickson wrote:
> >
> > No, as far as I know you can't combine wildcards in phrases. This would
> > get extraordinarily ugly extraordinarily quickly. The way Lucene handles
> > wildcards (conceputally) is to expand all the possible terms into a
> large
> > OR
> > clause. Say my index contains term1, term2, and term3. The search for
> > term*
> > really expands into term1 OR term2 OR term3. Now imagine the
> > complexity of a phrase like "dog* cat* hors*". Now say your index
> > contained
> > 10 terms starting with dog, 10 with cat and 10 with hors. You'd have
> 1,000
> > ORed phrase queries. And this is a tiny example....
> >
> > You can try various approximations, and depending upon your index size
> > they
> > may or may not work. For instance, you could index all the successive
> > shorter
> > forms. with increments of 0 (see synonym analyzer)  I.e. index horse,
> > hors$
> > hor$
> > ho$ h$ all in the same position. Then searching for hor* becomes
> searching
> > for
> > hor$ and it all "just works". Of course this makes your index
> bigger.....
> >
> > About your second issue: I'm not clear what your trying to accomplish.
> > It's
> > no
> > problem to add the same field multiple times for a document. That is,
> you
> > can
> > doc.add(new field("field1", ......)
> > doc.add(new field("field1", ......)
> > doc.add(new field("field1", ......)
> > doc.add(new field("field1", ......)
> > as many times as you want before you add the document to the index. For
> > retrieval you can call getFields ("field1") and get an array of Fields
> > back,
> > one
> > for each call to add above. You can also set the PositionIncrementGap
> > while
> > indexing to separate the termposition of the first term of successive
> > add()
> > calls
> > by, say, 100 (or whatever) if you need to worry about SpanNear or some
> > such.
> >
> > This may be waaaay off base. If so, could you give a concrete example of
> > what
> > your inputs are and how you want to search them?
> >
> > Best
> > Erick
> >
> > On Thu, Mar 6, 2008 at 7:28 AM, JensBurkhardt <[hidden email]>
> > wrote:
> >
> >>
> >> okay, another problem occured. I have different fields with the same
> >> name.
> >> I
> >> can't seperate them like naming them field1 field2 etc. cause while
> >> indexing
> >> i don't know how many fields i will need.
> >> Like a book has several signature numbers i want to save them in a
> field
> >> signature and when i search for such a number i want the search hit
> every
> >> single field and not all fields together.
> >> Right now i separate the string using an unique separator (in this case
> >> just
> >> $$$) so i can split the string into the numbers but i think this is
> kinda
> >> the worst form doing it.
> >>
> >>
> >>
> >>
> >> JensBurkhardt wrote:
> >> >
> >> > hey everybody,
> >> >
> >> > I'm wondering if it's possible to combine wildcards and phrase query.
> >> >
> >> > For example "term1 term*"
> >> >
> >> > I know that the documentation says "Lucene supports single and
> multiple
> >> > character wildcard searches within single terms (not within phrase
> >> > queries)" but maybe someone has had the same problem and found a
> >> solution.
> >> >
> >> > Thanks for your help
> >> >
> >> > Jens Burkhardt
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/combine-wildcard-and-phrase-query-tp15870647p15872169.html
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/combine-wildcard-and-phrase-query-tp15870647p15896083.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: combine wildcard and phrase query

hossman
In reply to this post by Erick Erickson

: No, as far as I know you can't combine wildcards in phrases. This would

The QueryParser doesn't support it, and there is no native query type
for it, but if you are willing to do the query expansion yourself, you can
build a MultiPhraseQuery (where you generate the terms using a
WildcardTermEnum)  

alternately, you could use something like SpanRegexQuery in combination
with a SpanNearQuery.

...lots of things are possible with Lucene Queries that just aren't
possible with the QueryParser.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]