Lucy::Search::RegexQuery ????

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Lucy::Search::RegexQuery ????

Nick D.
I was wondering if there is a way to query a Lucy index using regular expressions.

For example: The command `grep -i -P '65\d\s+Security' | grep -v -i -P '(?:654|656|650|652)\s+Security'` will search for "65" followed by 1 digit followed by any number of spaces followed by "Security" ignoring "654" "656" "650" "652". So potential results are something like this:

"stuff here 651 Security and more stuff"
"stuff here 653 Security and more stuff"
"stuff here 655                        Security and more stuff"

but it will not return any of the below:

"stuff here 651 not followed by Security and more stuff"
"stuff here 653 not followed by Security and more stuff"

Another example is searching for an ip with `?:\d{1,3}\.){3}\d{1,3}`

Is there anyway to accomplish this with the existing api?
are there any plans to support this?
If not fully supported what is supported?
If not supported at all what approach should I take to create something like this? (create something that converts regex to a bunch of ORQueries etc?)
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Lucy::Search::RegexQuery ????

Peter Karman
On 12/18/13 3:57 PM, Nick D. wrote:

> I was wondering if there is a way to query a Lucy index using regular
> expressions.
>
> For example: The command `grep -i -P '65\d\s+Security' | grep -v -i -P
> '(?:654|656|650|652)\s+Security'` will search for "65" followed by 1 digit
> followed by any number of spaces followed by "Security" ignoring "654" "656"
> "650" "652". So potential results are something like this:
>
> "stuff here 651 Security and more stuff"
> "stuff here 653 Security and more stuff"
> "stuff here 655                        Security and more stuff"
>
> but it will not return any of the below:
>
> "stuff here 651 not followed by Security and more stuff"
> "stuff here 653 not followed by Security and more stuff"
>
> Another example is searching for an ip with `?:\d{1,3}\.){3}\d{1,3}`
>
> Is there anyway to accomplish this with the existing api?
> are there any plans to support this?
> If not fully supported what is supported?
> If not supported at all what approach should I take to create something like
> this? (create something that converts regex to a bunch of ORQueries etc?)
>


Hi Nick,

There is no RegexQuery class in core as you describe it.

The closest thing on CPAN is LucyX::Search::WildcardQuery, which was
inspired by the PrefixQuery example in the Lucy docs, among other things.

There have been IRC discussions in years(!) past about porting the pure
Perl regex code in WildcardQuery to C and making it part of core, but
nobody's has the tuits for that.

The one qualifier to your examples vs WildcardQuery is that your
examples assume un-tokenized field values (e.g. Lucy::Plan::StringType),
which means you'd have to think carefully about how to plan out your
index schema to accommodate a regex against a phrase as well as a single
term. The WildcardQuery algorithm is to open each internal Lexicon and
examine each term in it for matches against a regex.

Internally, the WildcardQuery class creates an ORQuery using all the
terms in the Lexicon that match the query terms, so yes, that is one way
to approach this. If you're looking for examples of creating your own
query classes, you might look at prior art in
LucyX::Search::NullTermQuery as well as WildcardQuery, both on CPAN. I
also started a project here:

https://github.com/karpet/lucyx-search-delegatequery

to make this kind of thing easier, but haven't returned to it yet to
make sure it is CPAN-ready.

All that said, having created all those Query extensions myself, I
recommend avoiding that approach if you can. Pure Perl Query extensions
are much slower than the native C classes, and they can be awkward to
develop/debug because of the unholy trinity of Query/Compiler/Matcher
(much discussion about that in the lucy-dev archives).

I personally would look at a combination of
LucyX::Search::ProximityQuery and query expansion instead, using
Search::Query::Dialect::Lucy and Search::Query::Parser. That way you can
leverage the performance of the native Lucy query classes and still get
the flexibility you need for matching patterns.

Example (NOT TESTED):

# setup relevant field schema
my $searcher  = get_lucy_searcher();
my $schema    = $searcher->get_schema();
my @fieldnames = qw(
     ipaddr
     body
);
my %fields = ();

for my $f (@fieldnames) {
     $fields{$f} = {
         type     => $schema->fetch_type($f),
         analyzer => $schema->fetch_analyzer($f),
     };
}

# create query parser
my $qp = Search::Query::Parser->new(
     dialect          => 'Lucy',
     fields           => \%fields,
     croak_on_error   => 0,          # strict mode off
     sloppy           => 1,          # forgiving parser
     fixup            => 1,          # even more forgiveness
     null_term        => 'NULL',
     query_class_opts => {
         default_field => [
             qw( body )
         ],
     },
     term_expander => sub {
         my ( $term, $field ) = @_;
         return ($term) if ref $term;    # skip ranges
         if ( $field eq 'body' ) {

            # mangle a regex into an actual query
            # e.g.
            # '(?:654|656|650|652)\s+Security'
            # the array returned gets OR'd together
            return (
                qq/"654 security"/,
                qq/"656 security"/,
                qq/"650 security"/,
                qq/"652 security"/,
            );
         }
         return ($term);
     },
);

# run it
my $query = $qp->parse( qq/body:'(?:654|656|650|652)\s+Security'/ );
my $lucy_query = $query->as_lucy_query();
my $hits = $lucy_searcher->hits( query => $lucy_query );


--
Peter Karman  .  http://peknet.com/  .  [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Lucy::Search::RegexQuery ????

Nick D.
Thanks Peter.

Can you give me an example of how ProximityQuery works. There isn't much documentation on it. For example what is the "within" used for. Does it use the first term in the list and say that the second term must be "within" so many places of the first term? Does each term have to be "within" at max "x" amount of places from another term?
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Lucy::Search::RegexQuery ????

Peter Karman
On 12/31/13 10:58 AM, Nick D. wrote:
> Thanks Peter.
>
> Can you give me an example of how ProximityQuery works. There isn't much
> documentation on it. For example what is the "within" used for. Does it use
> the first term in the list and say that the second term must be "within" so
> many places of the first term? Does each term have to be "within" at max "x"
> amount of places from another term?
>

your guesses as to how it works are correct.

this might help:

http://mail-archives.apache.org/mod_mbox/lucy-user/201206.mbox/%3C4FE54DF0.8060400@...%3E




--
Peter Karman  .  http://peknet.com/  .  [hidden email]