Sorting based on a selling rate

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Sorting based on a selling rate

John Pailet
I want to implement a specific search based on a selling rate.

Let me explain this:

I have a book collection in my store.
I index my books like that:

- One Lucene Document by book
- Two Lucene Fields in the document
        - TITLE OF THE BOOK
        - KEYWORDS OF THE BOOK

The keyword field is a BOOSTED field (* 1000)

This is working fine :-)

Now, I would like to search and sort my books according to the selling rate of the

book.
Exemple:

If the user search for: "Java", the first books that lucene will return must be the

best sellers books based on this specific search: "Java"

If the user search for "Java and .net", the first books that lucene will return must

bze the best sellers books based on this specific search: "Java and .net"

For each query, I have a programm that records the product selling rate based on the

specifiq query.

For exemple:

Query: "Java" -> [ProductId:123, rate: 23%], [ProductId:222, rate: 15%],

[ProductId:567, rate: 7%]...


Query: "Java and .net" -> [ProductId:99, rate: 45%], [ProductId:194, rate: 30%],

[ProductId:93, rate: 10%]...

How can I return books based on the selling rate (for a specific query)
Must I developp a specifiq handler after the basic Lucene search and sort it

programmatically, or is it possible to implement a mechanism at Index or search time ?

Thank you for your help and sorry for my bad english ;-)

John
Reply | Threaded
Open this post in threaded view
|

RE: Sorting based on a selling rate

Dejan Nenov-2
(excuse the semi-appropriate forum to make this comment in - but it is very
brief and may actually help improve the final Lucene-based app)

You may also like to import popularity data from Amazon using their open
APIs and mix the relevancy between your own popularity score and theirs.

Dejan (affiliated with safaribooksonline)

-----Original Message-----
From: John Pailet [mailto:[hidden email]]
Sent: Monday, August 28, 2006 1:10 AM
To: [hidden email]
Subject: Sorting based on a selling rate


I want to implement a specific search based on a selling rate.

Let me explain this:

I have a book collection in my store.
I index my books like that:

- One Lucene Document by book
- Two Lucene Fields in the document
        - TITLE OF THE BOOK
        - KEYWORDS OF THE BOOK

The keyword field is a BOOSTED field (* 1000)

This is working fine :-)

Now, I would like to search and sort my books according to the selling rate
of the

book.
Exemple:

If the user search for: "Java", the first books that lucene will return must
be the

best sellers books based on this specific search: "Java"

If the user search for "Java and .net", the first books that lucene will
return must

bze the best sellers books based on this specific search: "Java and .net"

For each query, I have a programm that records the product selling rate
based on the

specifiq query.

For exemple:

Query: "Java" -> [ProductId:123, rate: 23%], [ProductId:222, rate: 15%],

[ProductId:567, rate: 7%]...


Query: "Java and .net" -> [ProductId:99, rate: 45%], [ProductId:194, rate:
30%],

[ProductId:93, rate: 10%]...

How can I return books based on the selling rate (for a specific query)
Must I developp a specifiq handler after the basic Lucene search and sort it


programmatically, or is it possible to implement a mechanism at Index or
search time ?

Thank you for your help and sorry for my bad english ;-)

John
--
View this message in context:
http://www.nabble.com/Sorting-based-on-a-selling-rate-tf2175860.html#a601609
0
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Sorting based on a selling rate

Chris Hostetter-3
In reply to this post by John Pailet

Sorting on an integer field can be done using any of the Searcher.search
methods which take a "Sort" object.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Sorting based on a selling rate

John Pailet
Hello,

Ok, for the sort object, but my problem is I don't know haox to retrieve (or store) information of the sell rate of the products (the sell rate deponds on the QUERY ! The sort is different for each queries.)

I imagine to connect to the DB and get sell rate of products for this specific query... but connecting to DB at each query is not a right choice ;-)

any idea ?

Chris Hostetter wrote
Sorting on an integer field can be done using any of the Searcher.search
methods which take a "Sort" object.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: Sorting based on a selling rate

Chris Hostetter-3

: Ok, for the sort object, but my problem is I don't know haox to retrieve (or
: store) information of the sell rate of the products (the sell rate deponds
: on the QUERY ! The sort is different for each queries.)
:
: I imagine to connect to the DB and get sell rate of products for this
: specific query... but connecting to DB at each query is not a right choice
: ;-)

I'm sorry .. i missunderstood your question ... rereading it now here
is what i sounds like you are saying:

When a user tells gives you a search term they are interested in,
you have an external system that that you use to look up the top N selling
books matching those terms, and what their sell rate is.  you would like
to do a search across your entire lucene index, having those N products
score higher based on their sell rate. (which is not in the index)

...assuming i have that right, try adding an optional clauses to your
query for each of your N productIds, with a boost which is proportionate
to the sell rate.  the exact boost values should be based on how importnat
you want the sell rate to be compared with with the textual relevancy of
the query term.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Sorting based on a selling rate

John Pailet
Yes, that is exactly what I want to do !

My external system gives me sell rate/Top N Selling books matching the user terms (query)

I don't know what is the best way:

Storing sell rate into lucene Fields of the documents... (multiple combination) and sort by this field,

or doing something like you said: "adding an optional clauses to your
query for each of your N productIds"... but I don't really know how to do this...

Any sample code exemple ?

Thank you very much,

John

Chris Hostetter wrote
: Ok, for the sort object, but my problem is I don't know haox to retrieve (or
: store) information of the sell rate of the products (the sell rate deponds
: on the QUERY ! The sort is different for each queries.)
:
: I imagine to connect to the DB and get sell rate of products for this
: specific query... but connecting to DB at each query is not a right choice
: ;-)

I'm sorry .. i missunderstood your question ... rereading it now here
is what i sounds like you are saying:

When a user tells gives you a search term they are interested in,
you have an external system that that you use to look up the top N selling
books matching those terms, and what their sell rate is.  you would like
to do a search across your entire lucene index, having those N products
score higher based on their sell rate. (which is not in the index)

...assuming i have that right, try adding an optional clauses to your
query for each of your N productIds, with a boost which is proportionate
to the sell rate.  the exact boost values should be based on how importnat
you want the sell rate to be compared with with the textual relevancy of
the query term.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

segments' size and getMaxMergeDocs()

Stanislav Jordanov
If IndexWriter.getMaxMergeDocs() always returns M
then which one is true:
1) No segment file will ever contain > M documents;
2) Any segment that participates in a merge contains <= M documents (but
the resulting segment of the merge may contain > M documents)

Obviously (1) implies (2) but my guess (based on my practical
experience) is that only (2) is true.


Stanislav

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Sorting based on a selling rate

Chris Hostetter-3
In reply to this post by John Pailet

: I don't know what is the best way:

that depends on your needs ... if Selling rate changes very infrequently,
or if you are dealing with teh sell rate for lots of books per
query then i'd put it in your index ... if it's constantly in flux and you
only care about the sell rate of one or two books for each query, doing it
at query time is fine.

the definitions of "infrequently", "lots", and "constantly" all being
specific to the scope of your problem, and not something i can give
general advice on.

: or doing something like you said: "adding an optional clauses to your
: query for each of your N productIds"... but I don't really know how to do
: this...
:
: Any sample code exemple ?

I suggest you start by looking at the javadocs for BooleanQuery.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

SpanRegex speed

Mark Miller-3
 * An implementation tying Java's built-in java.util.regex to RegexQuery.
 *
 * Note that because this implementation currently only returns null from
 * {@link #prefix} that queries using this implementation will enumerate and
 * attempt to {@link #match} each term for the specified field in the index.

Is this another way to say im gonna be friggen slow? Say it aint so...

I want to use this as a multi-phrase query...a spannear with a term that
could be the regex "term1|term2"

I need this. Pipe dream for speed on a huge index?

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SpanRegex speed

Erik Hatcher

On Aug 30, 2006, at 6:13 PM, Mark Miller wrote:

> * An implementation tying Java's built-in java.util.regex to  
> RegexQuery.
> *
> * Note that because this implementation currently only returns null  
> from
> * {@link #prefix} that queries using this implementation will  
> enumerate and
> * attempt to {@link #match} each term for the specified field in  
> the index.
>
> Is this another way to say im gonna be friggen slow? Say it aint so...

"slow" is relative.  It will enumerate all the terms for the  
specified field and run a regular expression match on each one.  The  
same thing happens with FuzzyQuery and prefixed WildcardQuery too.  
These aren't necessarily "slow", so try it and see.

> I want to use this as a multi-phrase query...a spannear with a term  
> that could be the regex "term1|term2"

What about nesting a SpanOrQuery for those two terms inside a  
SpanNearQuery?

> I need this. Pipe dream for speed on a huge index?

Feel free to implement a robust prefix method :)  It's much more  
difficult than I wanted to tackle when I created this  
infrastructure.  But thankfully Regexp implemented it, so you could  
use it for prefix computation and a different matcher implementation  
if you like.

        Erik



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SpanRegex speed

Mark Miller-3
Erik Hatcher wrote:

>
> On Aug 30, 2006, at 6:13 PM, Mark Miller wrote:
>> * An implementation tying Java's built-in java.util.regex to RegexQuery.
>> *
>> * Note that because this implementation currently only returns null from
>> * {@link #prefix} that queries using this implementation will
>> enumerate and
>> * attempt to {@link #match} each term for the specified field in the
>> index.
>>
>> Is this another way to say im gonna be friggen slow? Say it aint so...
>
> "slow" is relative.  It will enumerate all the terms for the specified
> field and run a regular expression match on each one.  The same thing
> happens with FuzzyQuery and prefixed WildcardQuery too.  These aren't
> necessarily "slow", so try it and see.
>
>> I want to use this as a multi-phrase query...a spannear with a term
>> that could be the regex "term1|term2"
>
> What about nesting a SpanOrQuery for those two terms inside a
> SpanNearQuery?
>
>> I need this. Pipe dream for speed on a huge index?
>
> Feel free to implement a robust prefix method :)  It's much more
> difficult than I wanted to tackle when I created this infrastructure.  
> But thankfully Regexp implemented it, so you could use it for prefix
> computation and a different matcher implementation if you like.
>
>     Erik
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Thanks for the info Erik. I did not realize that WildcardQuery and
FuzzyQuery did this as well. A lot of my concern was that I needed to
implement WildcardQuery as a SpanRegexQuery so that I could get nested
wildcard searches in my proximity searches. If it's the same speed as
WildcardQuery I am not worried. However, it seems like it could be even
faster:

I only need to support * and ? as wildcard does. I don't want to include
Jakarta regex with my distro. I made a new Regex implementation based on
the Java 5 util stuff that only allows * and ?.

I pass the pattern string into a short method that:
     * Removes single backslashes, halves double backslashes, escapes
     * non-alphanumeric, and records prefix. Ignores * and ?.

Then I replace * with .* and ? with *{1}.

Only supporting * and ? seems to make grabbing the prefix nice and simple.

Now my question: should I use this instead of wildcardquery even when
not in a span search? Sounds like it would be more efficient.
A
lso, how does a spanOr query work? Is the resulting span anchored at the
start of the word and the length of the word? Like a term span? So that
its an Or Term span? If there are more than one matches does the span
cover all of them or is each match a span the size of each hit?

Thanks,

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SpanRegex speed

Mark Miller-3
In reply to this post by Erik Hatcher
Ignore that last question. I see that you said prefix wildcard query and
not wildcard query. A quick look at the code seems to show it grabbing a
prefix as well.

Do you think one would be any faster than the other? Should I used
Wildcardqueries outside of spanqueries and the regexquery inside
spanqueries or use regex both places?

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SpanRegex speed

Erick Erickson
Let me chime in here on a different note.... before you get happy with
wildcard queries, take a look at the thread "I just don't get wildcards at
all". There is lots of good info that Erik, Chris and Otis provided me.

The danger with prefixquery and wildcard query is that they will throw
TooManyClauses exceptions when you start matching a number of terms (the
default is 1024, although you can make this much bigger if memory allows).
If you're aware of this and it is and will be OK in your app, ignore this.
But if your index is going to grow significantly, this is a real problem. I
went with implementing filters with WildCardTermEnum (you could also use
RegexTermEnum) for the wildcard portions of my query. Which has interesting
implications for spans, we elected to say spans didn't work with wildcards.

Anyway, as I said, if you're aware of the TooManyClauses issue and are sure
it doesn't matter, ignore me. After all, everybody else does <G>.....


Best
Erick



On 8/30/06, Mark Miller <[hidden email]> wrote:

>
> Ignore that last question. I see that you said prefix wildcard query and
> not wildcard query. A quick look at the code seems to show it grabbing a
> prefix as well.
>
> Do you think one would be any faster than the other? Should I used
> Wildcardqueries outside of spanqueries and the regexquery inside
> spanqueries or use regex both places?
>
> - Mark
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

graphically representing an index

SOMMERIA KLEIN Ariel Ext       VIACCESS-BU_DRM
Hi all,
I'm a newbie with Lucene and I'm looking to implement the following:
I want to index posts from a forum, and, rather than proposing a search
on the contents, graphically represent the contents of the index. More
precisely, I would like to have a list of the most popular words, with a
number next to each indicating how often they occur.
The icing on the cake would be to be able to click on such a word and
get a subset of the posts including that word.
Can Lucene be used for this? Has anyone already implemented it? Any
links?
I've dug around a bit without any success, but my apologies if this has
already been dealt with

-----------------------------------------

"Privileged/Confidential information may be contained in this e-mail
and attachments. This e-mail, including attachments, constitutes non-public information intended to be conveyed only to the designated recipient(s). If you are not an intended recipient, please delete this e-mail, including attachments, and notify us immediately. The unauthorized use, dissemination, distribution or reproduction of this e-mail, including attachments, is prohibited and may be unlawful. In general, the content of this e-mail and attachments does not constitute any form of commitment by VIACCESS SA."

-----------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: graphically representing an index

Andrzej Białecki-2
SOMMERIA KLEIN Ariel Ext VIACCESS-BU_DRM wrote:

> Hi all,
> I'm a newbie with Lucene and I'm looking to implement the following:
> I want to index posts from a forum, and, rather than proposing a search
> on the contents, graphically represent the contents of the index. More
> precisely, I would like to have a list of the most popular words, with a
> number next to each indicating how often they occur.
> The icing on the cake would be to be able to click on such a word and
> get a subset of the posts including that word.
> Can Lucene be used for this? Has anyone already implemented it? Any
> links?
> I've dug around a bit without any success, but my apologies if this has
> already been dealt with
>
>  

See http://www.getopt.org/luke for an example of such functionality.
However, I must disappoint you - the most frequent words in a corpus are
quite probably also most useless words. For English these are: the, a,
to, for, by, in, can, I, ...
 So, you will need to eliminate them from the top of the list to get any
useful results.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: graphically representing an index

Erick Erickson
In reply to this post by SOMMERIA KLEIN Ariel Ext VIACCESS-BU_DRM
Take a look at Luke (http://www.getopt.org/luke/). I think this does a lot
of what you're asking for. It's opensource, so you could see how it's done.
There are screenshots at the link above so you can see if it's actually what
you want.....

You might also want to look at the Term* classes in the API, particularly
TermDocs, TermEnum, TermFreqVector, TermPositionVector and TermPositions.

I'm quite sure all the information is there, it'll probably be interesting
to put it all together efficiently <G>

Hope this helps
Erick

On 8/31/06, SOMMERIA KLEIN Ariel Ext VIACCESS-BU_DRM <
[hidden email]> wrote:

>
> Hi all,
> I'm a newbie with Lucene and I'm looking to implement the following:
> I want to index posts from a forum, and, rather than proposing a search
> on the contents, graphically represent the contents of the index. More
> precisely, I would like to have a list of the most popular words, with a
> number next to each indicating how often they occur.
> The icing on the cake would be to be able to click on such a word and
> get a subset of the posts including that word.
> Can Lucene be used for this? Has anyone already implemented it? Any
> links?
> I've dug around a bit without any success, but my apologies if this has
> already been dealt with
>
> -----------------------------------------
>
> "Privileged/Confidential information may be contained in this e-mail
> and attachments. This e-mail, including attachments, constitutes
> non-public information intended to be conveyed only to the designated
> recipient(s). If you are not an intended recipient, please delete this
> e-mail, including attachments, and notify us immediately. The unauthorized
> use, dissemination, distribution or reproduction of this e-mail, including
> attachments, is prohibited and may be unlawful. In general, the content of
> this e-mail and attachments does not constitute any form of commitment by
> VIACCESS SA."
>
> -----------------------------------------
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Sorting based on a selling rate

John Pailet
In reply to this post by Chris Hostetter-3
Putting selling rate in the index is OK for me, I also think that is a good idea.

The problem is: I don't  know how to store the sell rate of the product that depends on a specific query

Can you please give me your idea about how to store it in the Lucene document ? (field/value)

Thank you very much,

John

 



Chris Hostetter wrote
: I don't know what is the best way:

that depends on your needs ... if Selling rate changes very infrequently,
or if you are dealing with teh sell rate for lots of books per
query then i'd put it in your index ... if it's constantly in flux and you
only care about the sell rate of one or two books for each query, doing it
at query time is fine.

the definitions of "infrequently", "lots", and "constantly" all being
specific to the scope of your problem, and not something i can give
general advice on.

: or doing something like you said: "adding an optional clauses to your
: query for each of your N productIds"... but I don't really know how to do
: this...
:
: Any sample code exemple ?

I suggest you start by looking at the javadocs for BooleanQuery.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Reply | Threaded
Open this post in threaded view
|

Re: SpanRegex speed

Mark Miller-3
In reply to this post by Erick Erickson
Erick Erickson wrote:

> Let me chime in here on a different note.... before you get happy with
> wildcard queries, take a look at the thread "I just don't get
> wildcards at
> all". There is lots of good info that Erik, Chris and Otis provided me.
>
> The danger with prefixquery and wildcard query is that they will throw
> TooManyClauses exceptions when you start matching a number of terms (the
> default is 1024, although you can make this much bigger if memory
> allows).
> If you're aware of this and it is and will be OK in your app, ignore
> this.
> But if your index is going to grow significantly, this is a real
> problem. I
> went with implementing filters with WildCardTermEnum (you could also use
> RegexTermEnum) for the wildcard portions of my query. Which has
> interesting
> implications for spans, we elected to say spans didn't work with
> wildcards.
>
> Anyway, as I said, if you're aware of the TooManyClauses issue and are
> sure
> it doesn't matter, ignore me. After all, everybody else does <G>.....
>
>
> Best
> Erick
>
>
>
> On 8/30/06, Mark Miller <[hidden email]> wrote:
>>
>> Ignore that last question. I see that you said prefix wildcard query and
>> not wildcard query. A quick look at the code seems to show it grabbing a
>> prefix as well.
>>
>> Do you think one would be any faster than the other? Should I used
>> Wildcardqueries outside of spanqueries and the regexquery inside
>> spanqueries or use regex both places?
>>
>> - Mark
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
Thanks a lot for the info Eric. Good stuff to know for sure.
I guess the real question I have been trying to spit out is this:
Is a span version of any of these searches--fuzzy, wildcard,
etc--inherently slower than their non-span brothers. If they have the
same limitations and speeds then that is all I am looking for.

P.S.
I realize I have been screwing up the threading by replying when
starting a new topic. I have been alerted and will stop this pernicious
activity.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: graphically representing an index

SOMMERIA KLEIN Ariel Ext       VIACCESS-BU_DRM
In reply to this post by Andrzej Białecki-2
Hi Andzej,
Thanks for the tip, it does what I want. You are right, though, it's of limited use for helping the user access data. But I'm sure it will come in handy for my own analysis.
Best,
Ariel

-----Message d'origine-----
De : Andrzej Bialecki [mailto:[hidden email]]
Envoyé : jeudi 31 août 2006 15:49
À : [hidden email]
Objet : Re: graphically representing an index

SOMMERIA KLEIN Ariel Ext VIACCESS-BU_DRM wrote:

> Hi all,
> I'm a newbie with Lucene and I'm looking to implement the following:
> I want to index posts from a forum, and, rather than proposing a search
> on the contents, graphically represent the contents of the index. More
> precisely, I would like to have a list of the most popular words, with a
> number next to each indicating how often they occur.
> The icing on the cake would be to be able to click on such a word and
> get a subset of the posts including that word.
> Can Lucene be used for this? Has anyone already implemented it? Any
> links?
> I've dug around a bit without any success, but my apologies if this has
> already been dealt with
>
>  

See http://www.getopt.org/luke for an example of such functionality.
However, I must disappoint you - the most frequent words in a corpus are
quite probably also most useless words. For English these are: the, a,
to, for, by, in, can, I, ...
 So, you will need to eliminate them from the top of the list to get any
useful results.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



-----------------------------------------

"Privileged/Confidential information may be contained in this e-mail
and attachments. This e-mail, including attachments, constitutes non-public information intended to be conveyed only to the designated recipient(s). If you are not an intended recipient, please delete this e-mail, including attachments, and notify us immediately. The unauthorized use, dissemination, distribution or reproduction of this e-mail, including attachments, is prohibited and may be unlawful. In general, the content of this e-mail and attachments does not constitute any form of commitment by VIACCESS SA."

-----------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: SpanRegex speed

Erick Erickson
In reply to this post by Mark Miller-3
OK, a not very helpful answer, but "of course they're slower, they do more
work" (the span versions). But that's fairly useless, since the question is
really "is it enough slower in my situation that I need to find an
alternative?". And the only way I know of to answer that question is to make
some tests with the data representing my particular problem......

Sorry I can't be more help....
Erick

On 9/1/06, Mark Miller <[hidden email]> wrote:

>
> Erick Erickson wrote:
> > Let me chime in here on a different note.... before you get happy with
> > wildcard queries, take a look at the thread "I just don't get
> > wildcards at
> > all". There is lots of good info that Erik, Chris and Otis provided me.
> >
> > The danger with prefixquery and wildcard query is that they will throw
> > TooManyClauses exceptions when you start matching a number of terms (the
> > default is 1024, although you can make this much bigger if memory
> > allows).
> > If you're aware of this and it is and will be OK in your app, ignore
> > this.
> > But if your index is going to grow significantly, this is a real
> > problem. I
> > went with implementing filters with WildCardTermEnum (you could also use
> > RegexTermEnum) for the wildcard portions of my query. Which has
> > interesting
> > implications for spans, we elected to say spans didn't work with
> > wildcards.
> >
> > Anyway, as I said, if you're aware of the TooManyClauses issue and are
> > sure
> > it doesn't matter, ignore me. After all, everybody else does <G>.....
> >
> >
> > Best
> > Erick
> >
> >
> >
> > On 8/30/06, Mark Miller <[hidden email]> wrote:
> >>
> >> Ignore that last question. I see that you said prefix wildcard query
> and
> >> not wildcard query. A quick look at the code seems to show it grabbing
> a
> >> prefix as well.
> >>
> >> Do you think one would be any faster than the other? Should I used
> >> Wildcardqueries outside of spanqueries and the regexquery inside
> >> spanqueries or use regex both places?
> >>
> >> - Mark
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
> Thanks a lot for the info Eric. Good stuff to know for sure.
> I guess the real question I have been trying to spit out is this:
> Is a span version of any of these searches--fuzzy, wildcard,
> etc--inherently slower than their non-span brothers. If they have the
> same limitations and speeds then that is all I am looking for.
>
> P.S.
> I realize I have been screwing up the threading by replying when
> starting a new topic. I have been alerted and will stop this pernicious
> activity.
>
> - Mark
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
12