Need some Advice on Searching

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Need some Advice on Searching

David Ahlschläger
Hi All.

Firstly I am new to using Lucene and all its API's.

I am trying to evaluate if Lucene can solve the following problem for me.

1. I need to temporarilly index sets of documents on the Fly say 100 at a
Time.
    This seems simple enough - I create a Index either on the File System or
in Memory - (This I can do.)
    with the following Fields for each Document: external_id (Fixed Length
String : 255 chars), contents (contents of HTML file)

2. I need to run a Fixed set of Queries against the Index I created on the
"contents" field.
    The Queries are in the form "123456789 OR 4323456 OR House OR Flat" or
more complicated
    like "((flat AND bed) OR (cat AND dog))"


My problem is that for these queries I need to know which Documents hit. I
also need to know which terms hit and if possible
the location of the hits for each term in the hit Document.

I can create queries using the Query Parser and get the Document that Hit.
This I assume is refered to as the Hits API ?

What I could really use is a brief decription of the steps I would need to
perform to solve the above, a point in the right direction
so to say.

Would I need to write my own Query Parser, Searcher, dig real deep into the
bowls of Lucene ect.

Suggestions would be realy appreciated.
Reply | Threaded
Open this post in threaded view
|

Re: Need some Advice on Searching

Chris Hostetter-3

i assume when you say this...

: 1. I need to temporarilly index sets of documents on the Fly say 100 at a
: Time.

you mean that you'll have lots of temporary indexes of a few hundrad
documents and then you'll do a bunch of queries and throw the index away.
Even if i'm wrong most of the rest of my advice will wtill be usefull, but
its' good to clarify.

: My problem is that for these queries I need to know which Documents hit. I
: also need to know which terms hit and if possible
: the location of the hits for each term in the hit Document.

knowing which docs match your is easy.  knowing where in a document a
particular term matches can be done using the TermPositions APIs ... but
it does you that info as a number of "terms" which for HTML content may be
confusing depending on how your analyzer deals with that HTML.

if you have complex boolean queries and you need to know which individual
pat of the query matched that's not really trivial.  you didn't mention
anything about "score" or "relevancy" in your email, so i'm guessing all
you care about is boolean "did it match or not" logic .. in that case
using Filters directly (without ever searching) is your friend.  You can
build a Filter for each individual clause, intersect/union the bitsets to
get the final set of matching documents for your whole query, but
inspect the individual bitsets to know he specifics about which ones match
which documents.

some people don't like Filters because of how much space they take up for
really large indexes, but if you've only got 100 docs ... there's no
reason not to use them


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Need some Advice on Searching

David Ahlschläger
On 19/05/06, Chris Hostetter <[hidden email]> wrote:

>
>
> i assume when you say this...
>
> : 1. I need to temporarilly index sets of documents on the Fly say 100 at
> a
> : Time.
>
> you mean that you'll have lots of temporary indexes of a few hundrad
> documents and then you'll do a bunch of queries and throw the index away.
> Even if i'm wrong most of the rest of my advice will wtill be usefull, but
> its' good to clarify.


Correct I will throw them away!

: My problem is that for these queries I need to know which Documents hit. I
> : also need to know which terms hit and if possible
> : the location of the hits for each term in the hit Document.
>
> knowing which docs match your is easy.  knowing where in a document a
> particular term matches can be done using the TermPositions APIs ... but
> it does you that info as a number of "terms" which for HTML content may be
> confusing depending on how your analyzer deals with that HTML.


Okay based on your answer and a little testing just to see what it gives me
- I assume
Lucene only stores the Term Offset (which is Analyser Dependent) and not the
Actual Offset as retrieved from the Plain Text Stream for the Term.

if you have complex boolean queries and you need to know which individual
> pat of the query matched that's not really trivial.  you didn't mention
> anything about "score" or "relevancy" in your email, so i'm guessing all
> you care about is boolean "did it match or not" logic .. in that case
> using Filters directly (without ever searching) is your friend.  You can
> build a Filter for each individual clause, intersect/union the bitsets to
> get the final set of matching documents for your whole query, but
> inspect the individual bitsets to know he specifics about which ones match
> which documents.


Score/Relavence is not Important. I need the Yes/No logic with the what
caused the Match Info. Could you mayby explain the intersect/union the
bitsets and the interogating to know
what matched?

some people don't like Filters because of how much space they take up for
> really large indexes, but if you've only got 100 docs ... there's no
> reason not to use them


Nope will never have any really large Indexes here 100 to 200 docs at the
most.

-Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
> Thanx for the Relpy much appreciated.
Reply | Threaded
Open this post in threaded view
|

Re: Need some Advice on Searching

Chris Hostetter-3

: Score/Relavence is not Important. I need the Yes/No logic with the what
: caused the Match Info. Could you mayby explain the intersect/union the
: bitsets and the interogating to know
: what matched?

let's say hypothetically the logical "query" you want is "(A OR B) AND (C
OR D)"  where A, B, C, and D can all be represented as Lucene Query
objects.

that means you can say something like...

   BitSet a = (new QueryFilter(A)).bits(reader);
   BitSet b = (new QueryFilter(B)).bits(reader);
   BitSet c = (new QueryFilter(C)).bits(reader);
   BitSet d = (new QueryFilter(D)).bits(reader);

   BitSet result = c.clone().or(d).and(a.clone().or(b));

...now each set bit in result corrisponds to a document that matches your
whole "query" (the bitIndex is the doc id). and for any given set bit, you
can look at a-d to find out which of the sub queries it matched on.

(QueryFilter isn't neccessary, any filter that selects the documents you
want will work, QueryFilter is just handy for illustrating my point
easily).


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]