Indexing the same field multiple times in a doc.

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing the same field multiple times in a doc.

Erick Erickson
That's waaaaay cool. I'm not entirely clear on this point, but my test sure
worked out well. So I thought I'd see if the behavior I'm seeing is
expected. If so, way cool coding guys!

Here's the root question: "Am I reasonably safe, for a single document, in
thinking of indexing multiple chunks with the same field as being identical,
for all practical purposes, with indexing the field once with all the chunks
concatenated together?".

Details follow...

Background: I'm working with books now, so the 10,000 word limit isn't
acceptable. It is, of course, easy to bump it up via
IndexWriter.setMaxFieldLength, but I'd rather just break up the input stream
since I don't have a realistic expectation of putting an upper limit on the
size of a doc. So I experimented a bit with indexing the same field multiple
times in a document. It seems to me that everything "just works", at least
in my simple test case. This is one heck of a coincidence if not intended
behavior <G>, so all I'm asking is if anybody is surprised by this
(especially Otis, Chris, Erik, etc).

Please ignore the Field.Store, I realize it will probably have to be NO for
this app......

Lucene is fine with indexing the same field multiple times, like this....

    Document doc = new Document();
    doc.add(new Field("tokens", "one two three", Field.Store.YES,
Field.Index.TOKENIZED));

    doc.add(new Field("tokens", "four five six", Field.Store.YES,
Field.Index.TOKENIZED));
    writer.addDocument(doc);

What surprised me a bit is that SpanQueries work just fine this way. If I
create a span query for "two" and "five", this doc is found for some slop
factors and not found for other slop factors, just as though I indexed the
"tokens" field once with "one two three four five six".

Boolean MUST clauses work as well.

So, are there any "gotchas" that spring to mind with the notion of chunking
the input to < 10,000 words and indexing the chunks multiple times in the
same field? Let me be clear I'm just beginning to design this, so all I'm
thinking about here is whether indexing the same field multiple times is
fraught with danger <G>, I have a bunch of details to work out about what
the requirements are and whether I really *want* to do this, but that's my
problem.

Thanks
Erick
Reply | Threaded
Open this post in threaded view
|

Re: Indexing the same field multiple times in a doc.

Erick Erickson
I think I just answered my own question. No, there's no difference.
Including running into an OutOfMemory error when adding the doc. It seems
like I can think of this exactly as concatenating all the chunks together
and indexing them all at once, including the limitations.

So even though it all seems to work for small examples, I don't think it
helps.

Never mind......

Erick

On 8/15/06, Erick Erickson <[hidden email]> wrote:

>
> That's waaaaay cool. I'm not entirely clear on this point, but my test
> sure worked out well. So I thought I'd see if the behavior I'm seeing is
> expected. If so, way cool coding guys!
>
> Here's the root question: "Am I reasonably safe, for a single document, in
> thinking of indexing multiple chunks with the same field as being identical,
> for all practical purposes, with indexing the field once with all the chunks
> concatenated together?".
>
> Details follow...
>
> Background: I'm working with books now, so the 10,000 word limit isn't
> acceptable. It is, of course, easy to bump it up via
> IndexWriter.setMaxFieldLength, but I'd rather just break up the input
> stream since I don't have a realistic expectation of putting an upper limit
> on the size of a doc. So I experimented a bit with indexing the same field
> multiple times in a document. It seems to me that everything "just works",
> at least in my simple test case. This is one heck of a coincidence if not
> intended behavior <G>, so all I'm asking is if anybody is surprised by this
> (especially Otis, Chris, Erik, etc).
>
> Please ignore the Field.Store, I realize it will probably have to be NO
> for this app......
>
> Lucene is fine with indexing the same field multiple times, like this....
>
>     Document doc = new Document();
>     doc.add(new Field("tokens", "one two three", Field.Store.YES,
> Field.Index.TOKENIZED));
>
>     doc.add(new Field("tokens", "four five six", Field.Store.YES,
> Field.Index.TOKENIZED ));
>     writer.addDocument(doc);
>
> What surprised me a bit is that SpanQueries work just fine this way. If I
> create a span query for "two" and "five", this doc is found for some slop
> factors and not found for other slop factors, just as though I indexed the
> "tokens" field once with "one two three four five six".
>
> Boolean MUST clauses work as well.
>
> So, are there any "gotchas" that spring to mind with the notion of
> chunking the input to < 10,000 words and indexing the chunks multiple times
> in the same field? Let me be clear I'm just beginning to design this, so all
> I'm thinking about here is whether indexing the same field multiple times is
> fraught with danger <G>, I have a bunch of details to work out about what
> the requirements are and whether I really *want* to do this, but that's my
> problem.
>
> Thanks
> Erick
>
Reply | Threaded
Open this post in threaded view
|

Re: Indexing the same field multiple times in a doc.

Chris Hostetter-3
In reply to this post by Erick Erickson
:
: Here's the root question: "Am I reasonably safe, for a single document, in
: thinking of indexing multiple chunks with the same field as being identical,
: for all practical purposes, with indexing the field once with all the chunks
: concatenated together?".

esentially that's true -- the differnce is in the way the
positionIncrimentGap method of your analyzer is used.  For a single large
field value, it is never used.  for two or more values, it's called in
between each value.

: What surprised me a bit is that SpanQueries work just fine this way. If I
: create a span query for "two" and "five", this doc is found for some slop
: factors and not found for other slop factors, just as though I indexed the
: "tokens" field once with "one two three four five six".

right ... that's where the positionIncrimentGap can come in handy .. you
can introduce a "large" gap size so that you can make phrase/span queries
which match across multiple values, and others which don't.

: So, are there any "gotchas" that spring to mind with the notion of chunking
: the input to < 10,000 words and indexing the chunks multiple times in the
: same field? Let me be clear I'm just beginning to design this, so all I'm

well, there's really nothing wrong with using multiple values, as far as
doing that to deal with the 10,000 terms limit...

 a) if you don't wnat the limit, change the limit -- there's no reason to
work arround it.
 b) i'm not entirely sure if that limit is on a single field value, or on
the total number of indexed tokens for that field name -- in which case
this approach doesn't work arround it at all -- make sure you test with
two fields whose total number of tokens is bigger then 10,000




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing the same field multiple times in a doc.

Yonik Seeley-2
On 8/15/06, Chris Hostetter <[hidden email]> wrote:
>  b) i'm not entirely sure if that limit is on a single field value, or on
> the total number of indexed tokens for that field name -- in which case
> this approach doesn't work arround it at all -- make sure you test with
> two fields whose total number of tokens is bigger then 10,000

maxFieldLength is number of tokens indexed *per field name*, not per
field value.  Breaking up a big field into multiple small fields with
the same field name does not change the number of tokens indexed.  As
Hoss says, simply change maxFieldLength to whatever you want.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]