indexing multiple email addresses in one field

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

indexing multiple email addresses in one field

Phil Whelan
Hi,

We have a very large lucene index that we're developing that has a
field of email addresses. (Actually mulitple fields with multiple
emails addresses, but I'll simplify here)

Each document will have one "email" field containing multiple email addresses.

I am indexing email addresses only using WhitespaceAnalyzer, so to
preserve the exact adresses and store multiple emails for one
document.

Example...
doc.add(new Field("email", "[hidden email] [hidden email] [hidden email]",
Field.Store.YES, Field.Index.ANALYZED ));

Terms for this document will then be...
email:[hidden email]
email:[hidden email]
email:[hidden email]

The problem I having is that these terms are rarely re-used in other
documents. There is little overlap with email usage, and there is a
lot of very long emails addresses. Because of this, the number of
terms in my index is very big and I think it's is causing performance
issues and bloating the index.

I think I'm not using Lucene optimally here.


A couple of questions...

1) Is there a way I can analyze these emails down to smaller terms but
still search for the exact email address? For instance, if I used a
different analyzer and broke these down to the terms "foo", "bar", and
"com", is Lucene able to find "email:[hidden email]" without matching
"email:[hidden email]"?

2) Does Lucene retain the positional information of tokens in the
index? Knowing this will help me anwer question 1.

Thanks,
Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing multiple email addresses in one field

Matthew Hall-7
1. Sure, just have an analyzer that splits on all non letter characters.
2. Phrase queries keep the order intact.  (And yes, the positional
information for the terms is kept, which is what allows span queries to
work)

So searching on the following "foo bar com" will match [hidden email] but
not [hidden email]

Matt

Phil Whelan wrote:

> Hi,
>
> We have a very large lucene index that we're developing that has a
> field of email addresses. (Actually mulitple fields with multiple
> emails addresses, but I'll simplify here)
>
> Each document will have one "email" field containing multiple email addresses.
>
> I am indexing email addresses only using WhitespaceAnalyzer, so to
> preserve the exact adresses and store multiple emails for one
> document.
>
> Example...
> doc.add(new Field("email", "[hidden email] [hidden email] [hidden email]",
> Field.Store.YES, Field.Index.ANALYZED ));
>
> Terms for this document will then be...
> email:[hidden email]
> email:[hidden email]
> email:[hidden email]
>
> The problem I having is that these terms are rarely re-used in other
> documents. There is little overlap with email usage, and there is a
> lot of very long emails addresses. Because of this, the number of
> terms in my index is very big and I think it's is causing performance
> issues and bloating the index.
>
> I think I'm not using Lucene optimally here.
>
>
> A couple of questions...
>
> 1) Is there a way I can analyze these emails down to smaller terms but
> still search for the exact email address? For instance, if I used a
> different analyzer and broke these down to the terms "foo", "bar", and
> "com", is Lucene able to find "email:[hidden email]" without matching
> "email:[hidden email]"?
>
> 2) Does Lucene retain the positional information of tokens in the
> index? Knowing this will help me anwer question 1.
>
> Thanks,
> Phil
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  


--
Matthew Hall
Software Engineer
Mouse Genome Informatics
[hidden email]
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing multiple email addresses in one field

Phil Whelan
On Thu, Jul 30, 2009 at 11:22 AM, Matthew Hall
<[hidden email]> wrote:
>
> 1. Sure, just have an analyzer that splits on all non letter characters.
> 2. Phrase queries keep the order intact.  (And yes, the positional information for the terms is kept, which is what allows span queries to work)
>
> So searching on the following "foo bar com" will match [hidden email] but not [hidden email]

Thanks, I really appreciate your help with this. That's great to know.
Can I take this a little further...

If I have "[hidden email] [hidden email] [hidden email]" and analyze it I get
"foo bar com bar foo com com bar foo", so perhaps I need a different
way of delimiting the emails, as it will match some other combinations
here, eg. [hidden email] which is not one of the emails.

Has anyone done anything similar? I can imagine that one option would
be to filter the returned docs based on the original content of the
string I'm analyzing. Does Lucene do this for me?

Thanks,
Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing multiple email addresses in one field

Matthew Hall-7
Place a delimiter between the email addresses that doesn't get removed
in your analyzer.  (preferably something you know will never be searched on)

That way you can ensure that each email matches independently of each other.

So something like

[hidden email] DELIM123 [hidden email] DELIM123 [hidden email]

Matt


Phil Whelan wrote:

> On Thu, Jul 30, 2009 at 11:22 AM, Matthew Hall
> <[hidden email]> wrote:
>  
>> 1. Sure, just have an analyzer that splits on all non letter characters.
>> 2. Phrase queries keep the order intact.  (And yes, the positional information for the terms is kept, which is what allows span queries to work)
>>
>> So searching on the following "foo bar com" will match [hidden email] but not [hidden email]
>>    
>
> Thanks, I really appreciate your help with this. That's great to know.
> Can I take this a little further...
>
> If I have "[hidden email] [hidden email] [hidden email]" and analyze it I get
> "foo bar com bar foo com com bar foo", so perhaps I need a different
> way of delimiting the emails, as it will match some other combinations
> here, eg. [hidden email] which is not one of the emails.
>
> Has anyone done anything similar? I can imagine that one option would
> be to filter the returned docs based on the original content of the
> string I'm analyzing. Does Lucene do this for me?
>
> Thanks,
> Phil
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  


--
Matthew Hall
Software Engineer
Mouse Genome Informatics
[hidden email]
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing multiple email addresses in one field

Paul Cowan-3
Matthew Hall wrote:
> Place a delimiter between the email addresses that doesn't get removed
> in your analyzer.  (preferably something you know will never be searched
> on)

Or add them separately (rather than:
   doc.add(new Field("email", "[hidden email] [hidden email] [hidden email]" ...);
use
   doc.add(new Field("email", "[hidden email]");
   doc.add(new Field("email", "[hidden email]");
   doc.add(new Field("email", "[hidden email]");
), using an Analyzer that overrides getPositionIncrementGap(). This
inserts a 'gap' between each set of Tokens for the same Field, which
stops phrase queries from 'crossing the boundaries' between subsequent
values.

Cheers,

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

ThreadedIndexWriter vs. IndexWriter

Jibo John
While trying out a few tuning options using contrib/benchmak as  
described in LIA (2nd edition) book, I had an interesting observation.

If I use a ThreadedIndexWriter (picked the example from lia2e, page  
356) instead of IndexWriter, the index size got reduced by 40%  
compared to using IndexWriter.
Index related configuration were the same for both the tests in the  
alg file.

I am curious how come using a threaded index writer will have an  
impact on the index size.

Appreciate your input.

Thanks,
-Jibo

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing multiple email addresses in one field

Phil Whelan
In reply to this post by Paul Cowan-3
Hi Matthew / Paul,

On Thu, Jul 30, 2009 at 4:32 PM, Paul Cowan<[hidden email]> wrote:

> Matthew Hall wrote:
>>
>> Place a delimiter between the email addresses that doesn't get removed in
>> your analyzer.  (preferably something you know will never be searched on)
>
> Or add them separately (rather than:
>  doc.add(new Field("email", "[hidden email] [hidden email] [hidden email]" ...);
> use
>  doc.add(new Field("email", "[hidden email]");
>  doc.add(new Field("email", "[hidden email]");
>  doc.add(new Field("email", "[hidden email]");
> ), using an Analyzer that overrides getPositionIncrementGap(). This inserts
> a 'gap' between each set of Tokens for the same Field, which stops phrase
> queries from 'crossing the boundaries' between subsequent values.

I like the sound of that! I think I understand it.
getPositionIncrementGap() returns 0 by default which keeps the "email"
field tokens sequential. Overriding with 1, will add an effective
blank token between the email addresses (overriding with 2 would leave
2). Similar to Matthew's delimiter token, but a bit neater.

So the token (with positions in brackets) would look something like this.

"foo(0) bar(1) com(2) bar(4) foo(5) com(6) com(8) bar(9) foo(10)"

Up until now I've only been using the WhiteSpaceAnalyzer, as I've been
keeping quite a tight control over the fields going into the index
(not making best use of Lucene).

What Analyzer would you recommend I use for this. I'll also be
indexing IPs, and other things, but that's pretty much the same story.
It seems I have to use the same Analyzer for the all the fields in the
index?

I've been looking at StandardAnalyzer, but I do not want to remove
stop words. I want to keep letters and numbers mainly, and also
override getPositionIncrementGap? Is there anything that does these
things already, or close to it? Overriding getPositionIncrementGap
shouldn't be difficult though.

Cheers,
Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing multiple email addresses in one field

Paul Cowan-3
Phil Whelan wrote:
> It seems I have to use the same Analyzer for the all the fields in the
> index?

Nope. Look at PerFieldAnalyzerWrapper, which is effectively a Map of
field names -> analyzers. This might help if different fields will have
very different values and semantics.

Cheers,

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ThreadedIndexWriter vs. IndexWriter

Michael McCandless-2
In reply to this post by Jibo John
Hmm... this doesn't sound right.

That example (ThreadedIndexWriter) is meant to be a drop-in
replacement, wherever you use an IndexWriter, that keeps an
under-the-hood thread pool (using java.util.concurrent.*) to
add/update documents with multiple threads.

It should not result in a smaller index.

Can you sanity check the index?  Eg is numDocs() the same for both?
You definitely called close() on the writer, right?  That method waits
for all threads to finish their work before actually closing.

Mike

On Thu, Jul 30, 2009 at 8:01 PM, Jibo John<[hidden email]> wrote:

> While trying out a few tuning options using contrib/benchmak as described in
> LIA (2nd edition) book, I had an interesting observation.
>
> If I use a ThreadedIndexWriter (picked the example from lia2e, page 356)
> instead of IndexWriter, the index size got reduced by 40% compared to using
> IndexWriter.
> Index related configuration were the same for both the tests in the alg
> file.
>
> I am curious how come using a threaded index writer will have an impact on
> the index size.
>
> Appreciate your input.
>
> Thanks,
> -Jibo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing multiple email addresses in one field

Matthew Hall-7
In reply to this post by Phil Whelan
And to address the stop word issue, you can override the stop word list
that it uses.

Most analyzers that use stop words, (Standard included) has an option to
pass it an arbitrary list of StopWords which will override the defaults.

You could also just roll your own (which is what you are going to end up
doing here anyhow)  When you do, just don't include stop word removal in
the processing of your token stream.

Matt

Phil Whelan wrote:

> Hi Matthew / Paul,
>
> On Thu, Jul 30, 2009 at 4:32 PM, Paul Cowan<[hidden email]> wrote:
>  
>> Matthew Hall wrote:
>>    
>>> Place a delimiter between the email addresses that doesn't get removed in
>>> your analyzer.  (preferably something you know will never be searched on)
>>>      
>> Or add them separately (rather than:
>>  doc.add(new Field("email", "[hidden email] [hidden email] [hidden email]" ...);
>> use
>>  doc.add(new Field("email", "[hidden email]");
>>  doc.add(new Field("email", "[hidden email]");
>>  doc.add(new Field("email", "[hidden email]");
>> ), using an Analyzer that overrides getPositionIncrementGap(). This inserts
>> a 'gap' between each set of Tokens for the same Field, which stops phrase
>> queries from 'crossing the boundaries' between subsequent values.
>>    
>
> I like the sound of that! I think I understand it.
> getPositionIncrementGap() returns 0 by default which keeps the "email"
> field tokens sequential. Overriding with 1, will add an effective
> blank token between the email addresses (overriding with 2 would leave
> 2). Similar to Matthew's delimiter token, but a bit neater.
>
> So the token (with positions in brackets) would look something like this.
>
> "foo(0) bar(1) com(2) bar(4) foo(5) com(6) com(8) bar(9) foo(10)"
>
> Up until now I've only been using the WhiteSpaceAnalyzer, as I've been
> keeping quite a tight control over the fields going into the index
> (not making best use of Lucene).
>
> What Analyzer would you recommend I use for this. I'll also be
> indexing IPs, and other things, but that's pretty much the same story.
> It seems I have to use the same Analyzer for the all the fields in the
> index?
>
> I've been looking at StandardAnalyzer, but I do not want to remove
> stop words. I want to keep letters and numbers mainly, and also
> override getPositionIncrementGap? Is there anything that does these
> things already, or close to it? Overriding getPositionIncrementGap
> shouldn't be difficult though.
>
> Cheers,
> Phil
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>  


--
Matthew Hall
Software Engineer
Mouse Genome Informatics
[hidden email]
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing multiple email addresses in one field

Phil Whelan
Thanks Matt. Thanks Paul. I'm up early (PST) and ready for a major
rewrite of my indexer. I think these changes are going to make a huge
difference.

Cheers,
Phil

On Fri, Jul 31, 2009 at 5:52 AM, Matthew Hall<[hidden email]> wrote:

> And to address the stop word issue, you can override the stop word list that
> it uses.
>
> Most analyzers that use stop words, (Standard included) has an option to
> pass it an arbitrary list of StopWords which will override the defaults.
>
> You could also just roll your own (which is what you are going to end up
> doing here anyhow)  When you do, just don't include stop word removal in the
> processing of your token stream.
>
> Matt
>
> Phil Whelan wrote:
>>
>> Hi Matthew / Paul,
>>
>> On Thu, Jul 30, 2009 at 4:32 PM, Paul Cowan<[hidden email]> wrote:
>>
>>>
>>> Matthew Hall wrote:
>>>
>>>>
>>>> Place a delimiter between the email addresses that doesn't get removed
>>>> in
>>>> your analyzer.  (preferably something you know will never be searched
>>>> on)
>>>>
>>>
>>> Or add them separately (rather than:
>>>  doc.add(new Field("email", "[hidden email] [hidden email] [hidden email]" ...);
>>> use
>>>  doc.add(new Field("email", "[hidden email]");
>>>  doc.add(new Field("email", "[hidden email]");
>>>  doc.add(new Field("email", "[hidden email]");
>>> ), using an Analyzer that overrides getPositionIncrementGap(). This
>>> inserts
>>> a 'gap' between each set of Tokens for the same Field, which stops phrase
>>> queries from 'crossing the boundaries' between subsequent values.
>>>
>>
>> I like the sound of that! I think I understand it.
>> getPositionIncrementGap() returns 0 by default which keeps the "email"
>> field tokens sequential. Overriding with 1, will add an effective
>> blank token between the email addresses (overriding with 2 would leave
>> 2). Similar to Matthew's delimiter token, but a bit neater.
>>
>> So the token (with positions in brackets) would look something like this.
>>
>> "foo(0) bar(1) com(2) bar(4) foo(5) com(6) com(8) bar(9) foo(10)"
>>
>> Up until now I've only been using the WhiteSpaceAnalyzer, as I've been
>> keeping quite a tight control over the fields going into the index
>> (not making best use of Lucene).
>>
>> What Analyzer would you recommend I use for this. I'll also be
>> indexing IPs, and other things, but that's pretty much the same story.
>> It seems I have to use the same Analyzer for the all the fields in the
>> index?
>>
>> I've been looking at StandardAnalyzer, but I do not want to remove
>> stop words. I want to keep letters and numbers mainly, and also
>> override getPositionIncrementGap? Is there anything that does these
>> things already, or close to it? Overriding getPositionIncrementGap
>> shouldn't be difficult though.
>>
>> Cheers,
>> Phil
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
> --
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> [hidden email]
> (207) 288-6012
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



--
Mobile: +1  778-233-4935
Website: http://philw.co.uk
Skype: philwhelan76
Twitter: philwhln
Email : [hidden email]
iChat: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ThreadedIndexWriter vs. IndexWriter

Jibo John
In reply to this post by Michael McCandless-2
Number of docs are the same in the index for both the cases (200,000).
I haven't altered the benchmark/ code, but, used a profiler to verify  
that  Benchmark main thread is closed only after all other  threads  
are closed.

Thanks,
-Jibo


On Jul 31, 2009, at 2:34 AM, Michael McCandless wrote:

> Hmm... this doesn't sound right.
>
> That example (ThreadedIndexWriter) is meant to be a drop-in
> replacement, wherever you use an IndexWriter, that keeps an
> under-the-hood thread pool (using java.util.concurrent.*) to
> add/update documents with multiple threads.
>
> It should not result in a smaller index.
>
> Can you sanity check the index?  Eg is numDocs() the same for both?
> You definitely called close() on the writer, right?  That method waits
> for all threads to finish their work before actually closing.
>
> Mike
>
> On Thu, Jul 30, 2009 at 8:01 PM, Jibo John<[hidden email]> wrote:
>> While trying out a few tuning options using contrib/benchmak as  
>> described in
>> LIA (2nd edition) book, I had an interesting observation.
>>
>> If I use a ThreadedIndexWriter (picked the example from lia2e, page  
>> 356)
>> instead of IndexWriter, the index size got reduced by 40% compared  
>> to using
>> IndexWriter.
>> Index related configuration were the same for both the tests in the  
>> alg
>> file.
>>
>> I am curious how come using a threaded index writer will have an  
>> impact on
>> the index size.
>>
>> Appreciate your input.
>>
>> Thanks,
>> -Jibo
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ThreadedIndexWriter vs. IndexWriter

Phil Whelan
Hi Jibo,

Have you tried optimizing indexes? I do not know anything about the
implementation of ThreadedIndexWriter, but if they both optimize down
to the same size, it could just mean that ThreadedIndexWriter is not
as optimized.

Thanks,
Phil

On Fri, Jul 31, 2009 at 11:38 AM, Jibo John<[hidden email]> wrote:
> Number of docs are the same in the index for both the cases (200,000).
> I haven't altered the benchmark/ code, but, used a profiler to verify that
>  Benchmark main thread is closed only after all other  threads are closed.
>
> Thanks,
> -Jibo

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ThreadedIndexWriter vs. IndexWriter

ohaya
In reply to this post by Jibo John
Hi,

Sorry to jump in, but I've been following this thread with interest
:)...

Am I misunderstanding your original observation, that
ThreadedIndexWriter produced smaller index?  Did the ThreadedIndexWriter
also finish faster (I'm assuming that it should)?

If the index is smaller, and everything else being good and equal,
doesn't that mean that using ThreadedIndexWriter is a good thing?

Anyway, aside from checking that the # of documents were the same, have
you looked at the index using something like Luke?  Does the contents of
the index look the same in both cases, or were they different?  If
different, how so (e.g., missing terms, etc.)?

Later,
Jim


On Fri, Jul 31, 2009 at 2:38 PM , Jibo John wrote:

> Number of docs are the same in the index for both the cases (200,000).
> I haven't altered the benchmark/ code, but, used a profiler to verify
> that  Benchmark main thread is closed only after all other  threads
> are closed.
>
> Thanks,
> -Jibo
>
>
> On Jul 31, 2009, at 2:34 AM, Michael McCandless wrote:
>
>> Hmm... this doesn't sound right.
>>
>> That example (ThreadedIndexWriter) is meant to be a drop-in
>> replacement, wherever you use an IndexWriter, that keeps an
>> under-the-hood thread pool (using java.util.concurrent.*) to
>> add/update documents with multiple threads.
>>
>> It should not result in a smaller index.
>>
>> Can you sanity check the index?  Eg is numDocs() the same for both?
>> You definitely called close() on the writer, right?  That method
>> waits
>> for all threads to finish their work before actually closing.
>>
>> Mike
>>
>> On Thu, Jul 30, 2009 at 8:01 PM, Jibo John<[hidden email]> wrote:
>>> While trying out a few tuning options using contrib/benchmak as
>>> described in
>>> LIA (2nd edition) book, I had an interesting observation.
>>>
>>> If I use a ThreadedIndexWriter (picked the example from lia2e, page
>>> 356)
>>> instead of IndexWriter, the index size got reduced by 40% compared
>>> to using
>>> IndexWriter.
>>> Index related configuration were the same for both the tests in the
>>> alg
>>> file.
>>>
>>> I am curious how come using a threaded index writer will have an
>>> impact on
>>> the index size.
>>>
>>> Appreciate your input.
>>>
>>> Thanks,
>>> -Jibo
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ThreadedIndexWriter vs. IndexWriter

Michael McCandless-2
In reply to this post by Jibo John
Hmmm... can you run CheckIndex on both indexes and post the results?

  java org.apache.lucene.index.CheckIndex /path/to/index

Mike

On Fri, Jul 31, 2009 at 2:38 PM, Jibo John<[hidden email]> wrote:

> Number of docs are the same in the index for both the cases (200,000).
> I haven't altered the benchmark/ code, but, used a profiler to verify that
>  Benchmark main thread is closed only after all other  threads are closed.
>
> Thanks,
> -Jibo
>
>
> On Jul 31, 2009, at 2:34 AM, Michael McCandless wrote:
>
>> Hmm... this doesn't sound right.
>>
>> That example (ThreadedIndexWriter) is meant to be a drop-in
>> replacement, wherever you use an IndexWriter, that keeps an
>> under-the-hood thread pool (using java.util.concurrent.*) to
>> add/update documents with multiple threads.
>>
>> It should not result in a smaller index.
>>
>> Can you sanity check the index?  Eg is numDocs() the same for both?
>> You definitely called close() on the writer, right?  That method waits
>> for all threads to finish their work before actually closing.
>>
>> Mike
>>
>> On Thu, Jul 30, 2009 at 8:01 PM, Jibo John<[hidden email]> wrote:
>>>
>>> While trying out a few tuning options using contrib/benchmak as described
>>> in
>>> LIA (2nd edition) book, I had an interesting observation.
>>>
>>> If I use a ThreadedIndexWriter (picked the example from lia2e, page 356)
>>> instead of IndexWriter, the index size got reduced by 40% compared to
>>> using
>>> IndexWriter.
>>> Index related configuration were the same for both the tests in the alg
>>> file.
>>>
>>> I am curious how come using a threaded index writer will have an impact
>>> on
>>> the index size.
>>>
>>> Appreciate your input.
>>>
>>> Thanks,
>>> -Jibo
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ThreadedIndexWriter vs. IndexWriter

Jibo John
In reply to this post by ohaya
Tried with a larger set of documents (2,000,000 ) this time.

ThreadedIndexWriter
-------------------------------
Size  - 1.4 G
optimized - yes (as suggested by Phil)
Number of documents - 1,999,924 (Not idea where the 76 documents  
vanished...)
Number of terms - 3,638,801


IndexWriter
---------------
Size - 1.8 G    (Noticed the size difference factor reduced to 23%)
optimized - yes
Number of documents  - 2,000,000  (All of them got in)
Number of terms - 10,624,806

I think it's getting complicated with more unanswered questions..

1. Why didn't those 76 docs get in while using ThreadedIndexWriter ?
2. Why would the number of terms triple for a difference of 76  
documents out of 2 million?
3. And, my original question..why there is still a huge variation in  
size difference b/n the two indexes


Thanks,
-Jibo


On Jul 31, 2009, at 1:44 PM, [hidden email] wrote:

> Hi,
>
> Sorry to jump in, but I've been following this thread with  
> interest :)...
>
> Am I misunderstanding your original observation, that  
> ThreadedIndexWriter produced smaller index?  Did the  
> ThreadedIndexWriter also finish faster (I'm assuming that it should)?
>
> If the index is smaller, and everything else being good and equal,  
> doesn't that mean that using ThreadedIndexWriter is a good thing?
>
> Anyway, aside from checking that the # of documents were the same,  
> have you looked at the index using something like Luke?  Does the  
> contents of the index look the same in both cases, or were they  
> different?  If different, how so (e.g., missing terms, etc.)?
>
> Later,
> Jim
>
>
> On Fri, Jul 31, 2009 at 2:38 PM , Jibo John wrote:
>
>> Number of docs are the same in the index for both the cases  
>> (200,000).
>> I haven't altered the benchmark/ code, but, used a profiler to  
>> verify that  Benchmark main thread is closed only after all other  
>> threads are closed.
>>
>> Thanks,
>> -Jibo
>>
>>
>> On Jul 31, 2009, at 2:34 AM, Michael McCandless wrote:
>>
>>> Hmm... this doesn't sound right.
>>>
>>> That example (ThreadedIndexWriter) is meant to be a drop-in
>>> replacement, wherever you use an IndexWriter, that keeps an
>>> under-the-hood thread pool (using java.util.concurrent.*) to
>>> add/update documents with multiple threads.
>>>
>>> It should not result in a smaller index.
>>>
>>> Can you sanity check the index?  Eg is numDocs() the same for both?
>>> You definitely called close() on the writer, right?  That method  
>>> waits
>>> for all threads to finish their work before actually closing.
>>>
>>> Mike
>>>
>>> On Thu, Jul 30, 2009 at 8:01 PM, Jibo John<[hidden email]> wrote:
>>>> While trying out a few tuning options using contrib/benchmak as  
>>>> described in
>>>> LIA (2nd edition) book, I had an interesting observation.
>>>>
>>>> If I use a ThreadedIndexWriter (picked the example from lia2e,  
>>>> page 356)
>>>> instead of IndexWriter, the index size got reduced by 40%  
>>>> compared to using
>>>> IndexWriter.
>>>> Index related configuration were the same for both the tests in  
>>>> the alg
>>>> file.
>>>>
>>>> I am curious how come using a threaded index writer will have an  
>>>> impact on
>>>> the index size.
>>>>
>>>> Appreciate your input.
>>>>
>>>> Thanks,
>>>> -Jibo
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ThreadedIndexWriter vs. IndexWriter

Jibo John
In reply to this post by Michael McCandless-2
Mike,

Here you go:


IndexWriter:
----------------
$ java -classpath /Users/jibo/Desktop/iwork/lucene/java/trunk/build/
lucene-core-2.9-dev.jar org.apache.lucene.index.CheckIndex /Users/jibo/
Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index

NOTE: testing will be more thorough if you run java with '-
ea:org.apache.lucene...', so assertions are enabled

Opening index @ /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/
benchmark/work/index

Segments file=segments_a numSegments=1 version=FORMAT_DIAGNOSTICS  
[Lucene 2.9]
  1 of 1: name=_18 docCount=200000
    compound=true
    hasProx=true
    numFiles=1
    size (MB)=427.448
    diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev  
779767M - 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386,  
optimize=true, mergeDocStores=true, java.vendor=Apple Inc.,  
os.version=10.5.7, source=merge, mergeFactor=4}
    no deletions
    test: open reader.........OK
    test: fields, norms.......OK [4 fields]
    test: terms, freq, prox...OK [3512343 terms; 80020204 terms/docs  
pairs; 163219760 tokens]
    test: stored fields.......OK [200000 total field count; avg 1  
fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/
freq vector fields per doc]

No problems were detected with this index.


ThreadedIndexWriter:
-----------------------------

$ java -classpath /Users/jibo/Desktop/iwork/lucene/java/trunk/build/
lucene-core-2.9-dev.jar org.apache.lucene.index.CheckIndex /Users/jibo/
Desktop/iwork/lucene/java/trunk/contrib/benchmark/work/index

NOTE: testing will be more thorough if you run java with '-
ea:org.apache.lucene...', so assertions are enabled

Opening index @ /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/
benchmark/work/index

Segments file=segments_3 numSegments=1 version=FORMAT_DIAGNOSTICS  
[Lucene 2.9]
  1 of 1: name=_q docCount=199970
    compound=true
    hasProx=true
    numFiles=3
    size (MB)=319.107
    diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev  
779767M - 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386,  
optimize=true, mergeDocStores=false, java.vendor=Apple Inc.,  
os.version=10.5.7, source=merge, mergeFactor=6}
    docStoreOffset=0
    docStoreSegment=_0
    docStoreIsCompoundFile=false
    no deletions
    test: open reader.........OK
    test: fields, norms.......OK [4 fields]
    test: terms, freq, prox...OK [1227086 terms; 69244121 terms/docs  
pairs; 134390948 tokens]
    test: stored fields.......OK [199970 total field count; avg 1  
fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/
freq vector fields per doc]

No problems were detected with this index.


$



On Jul 31, 2009, at 2:52 PM, Michael McCandless wrote:

> Hmmm... can you run CheckIndex on both indexes and post the results?
>
>  java org.apache.lucene.index.CheckIndex /path/to/index
>
> Mike
>
> On Fri, Jul 31, 2009 at 2:38 PM, Jibo John<[hidden email]> wrote:
>> Number of docs are the same in the index for both the cases  
>> (200,000).
>> I haven't altered the benchmark/ code, but, used a profiler to  
>> verify that
>>  Benchmark main thread is closed only after all other  threads are  
>> closed.
>>
>> Thanks,
>> -Jibo
>>
>>
>> On Jul 31, 2009, at 2:34 AM, Michael McCandless wrote:
>>
>>> Hmm... this doesn't sound right.
>>>
>>> That example (ThreadedIndexWriter) is meant to be a drop-in
>>> replacement, wherever you use an IndexWriter, that keeps an
>>> under-the-hood thread pool (using java.util.concurrent.*) to
>>> add/update documents with multiple threads.
>>>
>>> It should not result in a smaller index.
>>>
>>> Can you sanity check the index?  Eg is numDocs() the same for both?
>>> You definitely called close() on the writer, right?  That method  
>>> waits
>>> for all threads to finish their work before actually closing.
>>>
>>> Mike
>>>
>>> On Thu, Jul 30, 2009 at 8:01 PM, Jibo John<[hidden email]> wrote:
>>>>
>>>> While trying out a few tuning options using contrib/benchmak as  
>>>> described
>>>> in
>>>> LIA (2nd edition) book, I had an interesting observation.
>>>>
>>>> If I use a ThreadedIndexWriter (picked the example from lia2e,  
>>>> page 356)
>>>> instead of IndexWriter, the index size got reduced by 40%  
>>>> compared to
>>>> using
>>>> IndexWriter.
>>>> Index related configuration were the same for both the tests in  
>>>> the alg
>>>> file.
>>>>
>>>> I am curious how come using a threaded index writer will have an  
>>>> impact
>>>> on
>>>> the index size.
>>>>
>>>> Appreciate your input.
>>>>
>>>> Thanks,
>>>> -Jibo
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ThreadedIndexWriter vs. IndexWriter

Phil Whelan
Hi Jibo,

Your mergeFactor is different, and the resulting numFiles (segment
files) is different. Maybe each thread is responsible for a segment
file. Just curious - do you have 3 threads?

Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ThreadedIndexWriter vs. IndexWriter

ohaya
In reply to this post by Jibo John
Hi,

I don't know the answer to your questions, but I'm guessing that the answer to #3 is probably because the answers to #1 and #2.  

Did you try to look at the indexes using Luke?  That shows the top 50 terms when it starts, so it might be obvious what the differences are, and that might give someone here (more knowledgeable than myself) a hint as to what's going on.

Jim



---- Jibo John <[hidden email]> wrote:

> Tried with a larger set of documents (2,000,000 ) this time.
>
> ThreadedIndexWriter
> -------------------------------
> Size  - 1.4 G
> optimized - yes (as suggested by Phil)
> Number of documents - 1,999,924 (Not idea where the 76 documents  
> vanished...)
> Number of terms - 3,638,801
>
>
> IndexWriter
> ---------------
> Size - 1.8 G    (Noticed the size difference factor reduced to 23%)
> optimized - yes
> Number of documents  - 2,000,000  (All of them got in)
> Number of terms - 10,624,806
>
> I think it's getting complicated with more unanswered questions..
>
> 1. Why didn't those 76 docs get in while using ThreadedIndexWriter ?
> 2. Why would the number of terms triple for a difference of 76  
> documents out of 2 million?
> 3. And, my original question..why there is still a huge variation in  
> size difference b/n the two indexes
>
>
> Thanks,
> -Jibo
>
>
> On Jul 31, 2009, at 1:44 PM, [hidden email] wrote:
>
> > Hi,
> >
> > Sorry to jump in, but I've been following this thread with  
> > interest :)...
> >
> > Am I misunderstanding your original observation, that  
> > ThreadedIndexWriter produced smaller index?  Did the  
> > ThreadedIndexWriter also finish faster (I'm assuming that it should)?
> >
> > If the index is smaller, and everything else being good and equal,  
> > doesn't that mean that using ThreadedIndexWriter is a good thing?
> >
> > Anyway, aside from checking that the # of documents were the same,  
> > have you looked at the index using something like Luke?  Does the  
> > contents of the index look the same in both cases, or were they  
> > different?  If different, how so (e.g., missing terms, etc.)?
> >
> > Later,
> > Jim
> >
> >
> > On Fri, Jul 31, 2009 at 2:38 PM , Jibo John wrote:
> >
> >> Number of docs are the same in the index for both the cases  
> >> (200,000).
> >> I haven't altered the benchmark/ code, but, used a profiler to  
> >> verify that  Benchmark main thread is closed only after all other  
> >> threads are closed.
> >>
> >> Thanks,
> >> -Jibo
> >>
> >>
> >> On Jul 31, 2009, at 2:34 AM, Michael McCandless wrote:
> >>
> >>> Hmm... this doesn't sound right.
> >>>
> >>> That example (ThreadedIndexWriter) is meant to be a drop-in
> >>> replacement, wherever you use an IndexWriter, that keeps an
> >>> under-the-hood thread pool (using java.util.concurrent.*) to
> >>> add/update documents with multiple threads.
> >>>
> >>> It should not result in a smaller index.
> >>>
> >>> Can you sanity check the index?  Eg is numDocs() the same for both?
> >>> You definitely called close() on the writer, right?  That method  
> >>> waits
> >>> for all threads to finish their work before actually closing.
> >>>
> >>> Mike
> >>>
> >>> On Thu, Jul 30, 2009 at 8:01 PM, Jibo John<[hidden email]> wrote:
> >>>> While trying out a few tuning options using contrib/benchmak as  
> >>>> described in
> >>>> LIA (2nd edition) book, I had an interesting observation.
> >>>>
> >>>> If I use a ThreadedIndexWriter (picked the example from lia2e,  
> >>>> page 356)
> >>>> instead of IndexWriter, the index size got reduced by 40%  
> >>>> compared to using
> >>>> IndexWriter.
> >>>> Index related configuration were the same for both the tests in  
> >>>> the alg
> >>>> file.
> >>>>
> >>>> I am curious how come using a threaded index writer will have an  
> >>>> impact on
> >>>> the index size.
> >>>>
> >>>> Appreciate your input.
> >>>>
> >>>> Thanks,
> >>>> -Jibo
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [hidden email]
> >>>> For additional commands, e-mail: [hidden email]
> >>>>
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [hidden email]
> >>> For additional commands, e-mail: [hidden email]
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: ThreadedIndexWriter vs. IndexWriter

Jibo John
In reply to this post by Phil Whelan
Hi Phil,

It's 5 threads for IndexWriter.

For ThreadedIndexWriter, I used:

writer.num.threads=16
writer.max.thread.queue.size=80

Thanks,
-Jibo

On Jul 31, 2009, at 5:01 PM, Phil Whelan wrote:

> Hi Jibo,
>
> Your mergeFactor is different, and the resulting numFiles (segment
> files) is different. Maybe each thread is responsible for a segment
> file. Just curious - do you have 3 threads?
>
> Phil
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12