[jira] Created: (LUCENE-2252) stored field retrieve slow

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-2252) stored field retrieve slow

ASF GitHub Bot (Jira)
stored field retrieve slow
--------------------------

                 Key: LUCENE-2252
                 URL: https://issues.apache.org/jira/browse/LUCENE-2252
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Store
    Affects Versions: 3.0
            Reporter: John Wang


IndexReader.document() on a stored field is rather slow. Did a simple multi-threaded test and profiled it:

40+% time is spent in getting the offset from the index file
30+% time is spent in reading the count (e.g. number of fields to load)

Although I ran it on my lap top where the disk isn't that great, but still seems to be much room in improvement, e.g. load field index file into memory (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to other stuff being loaded)

A related note, are there plans to have custom segments as part of flexible indexing feature?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2252) stored field retrieve slow

ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830564#action_12830564 ]

Robert Muir commented on LUCENE-2252:
-------------------------------------

John, couldnt you simply write your own Directory if you want to put the fdx in RAM? I am not sure about 'peanuts', some people may not to pay 8 bytes/doc or whatever it is for this stored field offset, when the memory could be used better for other purposes.


> stored field retrieve slow
> --------------------------
>
>                 Key: LUCENE-2252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2252
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 3.0
>            Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still seems to be much room in improvement, e.g. load field index file into memory (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible indexing feature?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2252) stored field retrieve slow

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830582#action_12830582 ]

Uwe Schindler commented on LUCENE-2252:
---------------------------------------

FileSwitchDirectory comes into my mind. Just delegate the *.fdx extension into RAMDirectory. On instantiation of the dir, create the copy during wrapping with FileSwitchDir.

> stored field retrieve slow
> --------------------------
>
>                 Key: LUCENE-2252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2252
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 3.0
>            Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still seems to be much room in improvement, e.g. load field index file into memory (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible indexing feature?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2252) stored field retrieve slow

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830599#action_12830599 ]

John Wang commented on LUCENE-2252:
-----------------------------------

Thanks Uwe for the pointer. Will check that out!

Robert, we can get away with 4 bytes per doc assuming we are not storing 2GB of data per doc. This memory would be less than the data structure needed to be held in memory for only 1 field cache entry for sort. I understand it is always better to use less memory, but sometimes we do have to make trade-off decisions.
But you are right, different applications have different needs/requirements, so having support for custom segments would be a good thing. e.g. LUCENE-1914

> stored field retrieve slow
> --------------------------
>
>                 Key: LUCENE-2252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2252
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 3.0
>            Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still seems to be much room in improvement, e.g. load field index file into memory (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible indexing feature?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2252) stored field retrieve slow

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830603#action_12830603 ]

Yonik Seeley commented on LUCENE-2252:
--------------------------------------

The thing about stored fields is that it's normally not inner-loop stuff.  The index may be 100M documents, but the average application pages through hits a handful at a time.  And when loading stored fields gets really slow, it tends to be the OS cache misses due to the index being large.  We should still optimize it if we can of course (some apps do access many fields at once), but I agree with Robert that a direct in-memory stored field index probably wouldn't be a good default.

John, do you have a specific use case where this is the bottleneck, or are you just looking for places to optimize in general?

> stored field retrieve slow
> --------------------------
>
>                 Key: LUCENE-2252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2252
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 3.0
>            Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still seems to be much room in improvement, e.g. load field index file into memory (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible indexing feature?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2252) stored field retrieve slow

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830615#action_12830615 ]

Robert Muir commented on LUCENE-2252:
-------------------------------------

bq. Robert, we can get away with 4 bytes per doc assuming we are not storing 2GB of data per doc

I do not understand, I think the fdx index is the raw offset into fdt for some doc, and must remain a long if you have more than 2GB total across all docs.


> stored field retrieve slow
> --------------------------
>
>                 Key: LUCENE-2252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2252
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 3.0
>            Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still seems to be much room in improvement, e.g. load field index file into memory (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible indexing feature?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2252) stored field retrieve slow

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830627#action_12830627 ]

John Wang commented on LUCENE-2252:
-----------------------------------

bq. I do not understand, I think the fdx index is the raw offset into fdt for some doc, and must remain a long if you have more than 2GB total across all docs.

as stated earlier,  assuming we are not storing 2GB of data per doc, you don't need to keep a long per doc. There are many ways of representing this without paying much performance penalty. Off the top of my head, this would work:

since positions are always positive, you can indicate using the first bit to see if MAX_INT is reached, if so, add MAX_INT to the masked bits. You get away with int per doc.

I am sure with there are other tons of neat stuff for this the Mikes or Yonik can come up with :)

bq. John, do you have a specific use case where this is the bottleneck, or are you just looking for places to optimize in general?

Hi Yonik, I understand this may not be a common use case. I am trying to use Lucene as a store solution. e.g. supporting just get()/put() operations as a content store. We wrote something simple in house and I compared it against lucene, and the difference was dramatic. So after profiling, just seems this is an area with lotsa room for improvement. (posted earlier)

Reasons:
1) Our current setup is that the content is stored outside of the search cluster. It just seems being able to fetch the data for rendering/highlighting within our search cluster would be good.
2) If the index contains the original data, changing indexing schema, e.g. reindexing can be done within each partition/node. Getting data from our authoratative datastore is expensive.

Perhaps LUCENE-1912 is the right way to go rather than "fixing" stored fields. If you also agree, I can just dup it over.

Thanks

-John


> stored field retrieve slow
> --------------------------
>
>                 Key: LUCENE-2252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2252
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 3.0
>            Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still seems to be much room in improvement, e.g. load field index file into memory (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible indexing feature?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2252) stored field retrieve slow

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830628#action_12830628 ]

John Wang commented on LUCENE-2252:
-----------------------------------

Sorry, I meant LUCENE-1914

> stored field retrieve slow
> --------------------------
>
>                 Key: LUCENE-2252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2252
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 3.0
>            Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still seems to be much room in improvement, e.g. load field index file into memory (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible indexing feature?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2252) stored field retrieve slow

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830632#action_12830632 ]

Robert Muir commented on LUCENE-2252:
-------------------------------------

bq. as stated earlier, assuming we are not storing 2GB of data per doc, you don't need to keep a long per doc.

right, you stated this, but even if your 'store long into an int' works, I still think 4 bytes/doc is too much (its too much wasted ram for virtually no gain)

I dont understand why you need something like a custom segment file to do this, why cant you just simply use Directory to load this particular file into memory for your use case?


> stored field retrieve slow
> --------------------------
>
>                 Key: LUCENE-2252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2252
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 3.0
>            Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still seems to be much room in improvement, e.g. load field index file into memory (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible indexing feature?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2252) stored field retrieve slow

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830641#action_12830641 ]

John Wang commented on LUCENE-2252:
-----------------------------------

bq. I still think 4 bytes/doc is too much (its too much wasted ram for virtually no gain)

That depends on the application. In modern machines (at least with the machines we are using, e.g. a macbook pro) we can afford it :) I am not sure I agree with "virtually no gain" if you look at the numbers I posted. IMHO, the gain is significant.

I hate to get into a subjective argument on this though.

bq. I dont understand why you need something like a custom segment file to do this, why cant you just simply use Directory to load this particular file into memory for your use case?

Having a custom segment allows me to not having to get into this subjective argument in what is too much memory or what is the gain, since it just depends on my application, right?

Furthermore, with the question at hand, even if we do use Directory implementation Uwe suggested, it is not optimal. For my use case, the cost of the seek/read for the count on the data file is very wasteful. Also even for getting position, I can just a random access into an array compare to a in-memory seek,read/parse.

The very simple store mechanism we have written outside of lucene has a gain of >85x, yes, 8500%, over lucene stored fields. We would like to however, take advantage of the some of the good stuff already in lucene, e.g.  merge mechanism (which is very nicely done), delete handling etc.


> stored field retrieve slow
> --------------------------
>
>                 Key: LUCENE-2252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2252
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 3.0
>            Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still seems to be much room in improvement, e.g. load field index file into memory (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible indexing feature?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2252) stored field retrieve slow

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830642#action_12830642 ]

Robert Muir commented on LUCENE-2252:
-------------------------------------

bq. In modern machines (at least with the machines we are using, e.g. a macbook pro)

its not really subjective, or based on modern machines. you are talking about 5M documents, some indexes have a lot more documents and 4bytes/doc in ram adds up to a lot!
for the case of using lucene as a search engine library, this memory could be better spent on other things.
I dont think this is subjective, because its a search engine library, not a document store.

bq. Furthermore, with the question at hand, even if we do use Directory implementation Uwe suggested, it is not optimal.

but it is easy, and takes away your disk seek. the "in-memory seek, read/parse" is as you say, peanuts in comparison.


> stored field retrieve slow
> --------------------------
>
>                 Key: LUCENE-2252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2252
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 3.0
>            Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still seems to be much room in improvement, e.g. load field index file into memory (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible indexing feature?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2252) stored field retrieve slow

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830689#action_12830689 ]

Michael McCandless commented on LUCENE-2252:
--------------------------------------------

bq. The very simple store mechanism we have written outside of lucene has a gain of >85x, yes, 8500%, over lucene stored fields.

John can you describe the approach here?

> stored field retrieve slow
> --------------------------
>
>                 Key: LUCENE-2252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2252
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 3.0
>            Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still seems to be much room in improvement, e.g. load field index file into memory (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible indexing feature?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2252) stored field retrieve slow

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848972#action_12848972 ]

John Wang commented on LUCENE-2252:
-----------------------------------

Hi Mike:

     Sorry for the late reply. We have written something for this purpose:

http://snaprojects.jira.com/wiki/display/KRTI/Krati+Performance+Evaluation

Thanks

-John

> stored field retrieve slow
> --------------------------
>
>                 Key: LUCENE-2252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2252
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 3.0
>            Reporter: John Wang
>
> IndexReader.document() on a stored field is rather slow. Did a simple multi-threaded test and profiled it:
> 40+% time is spent in getting the offset from the index file
> 30+% time is spent in reading the count (e.g. number of fields to load)
> Although I ran it on my lap top where the disk isn't that great, but still seems to be much room in improvement, e.g. load field index file into memory (for a 5M doc index, the extra memory footprint is 20MB, peanuts comparing to other stuff being loaded)
> A related note, are there plans to have custom segments as part of flexible indexing feature?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]