Java heap space

classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Java heap space

Bill Au
FYI, I have just committed the a

On 5/8/06, Bill Au <[hidden email]> wrote:

>
> I was able to produce an OutOfMemoryError using Yonik's python script with
> Jetty 6.
> I was not able to do so with Jetty 5.1.11RC0, the latest stable version.
> So that's the
> version of Jetty with which I will downgrade the Solr example app to.
>
> Bill
>
>
> On 5/5/06, Erik Hatcher <[hidden email]> wrote:
> >
> > Along these lines, locally I've been using the latest stable version
> > of Jetty and it has worked fine, but I did see an "out of memory"
> > exception the other day but have not seen it since so I'm not sure
> > what caused it.
> >
> > Moving to Tomcat, as long as we can configure it to be as lightweight
> > as possible, is quite fine to me as well.
> >
> >         Erik
> >
> >
> > On May 5, 2006, at 12:12 PM, Bill Au wrote:
> >
> > > There seems to be a fair number of folks using the jetty with the
> > > example
> > > app
> > > as oppose to using Solr with their own appserver.  So I think it is
> > > best to
> > > use a stable version of Jetty instead of the beta.  If no one
> > > objects, I can
> > > go ahead and take care of this.
> > >
> > > Bill
> > >
> > > On 5/4/06, Yonik Seeley <[hidden email]> wrote:
> > >>
> > >> I verified that Tomcat 5.5.17 doesn't experience this problem.
> > >>
> > >> -Yonik
> > >>
> > >> On 5/4/06, Yonik Seeley <[hidden email]> wrote:
> > >> > On 5/3/06, Yonik Seeley < [hidden email]> wrote:
> > >> > > I just tried sending in 100,000 deletes and it didn't cause a
> > >> problem:
> > >> > > the memory grew from 22M to 30M.
> > >> > >
> > >> > > Random thought: perhaps it has something to do with how you are
> > >> > > sending your requests?
> > >> >
> > >> > Yep, I was able to reproduce a memory problem w/ Jetty on Linux
> > >> when
> > >> > using non-persistent connections (closed after each request).  The
> > >> > same 100,000 deletes blew up the JVM to 1GB heap.
> > >> >
> > >> > So this looks like it could be a Jetty problem (shame on me for
> > >> using a
> > >> beta).
> > >> > I'm still not quite sure what changed in Solr that could make it
> > >> > appear in later version and not in earlier versions though... the
> > >> > version of Jetty is the same.
> > >>
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Java heap space

Bill Au
Sorry, hit the wrong key before...

FYI, I have just committed all the changes related to the Jetty downgrade
into SVN.
Let me know if you notice anything problems.

Bill

On 5/9/06, Bill Au <[hidden email]> wrote:

>
> FYI, I have just committed the a
>
>
> On 5/8/06, Bill Au <[hidden email]> wrote:
> >
> > I was able to produce an OutOfMemoryError using Yonik's python script
> > with Jetty 6.
> > I was not able to do so with Jetty 5.1.11RC0, the latest stable
> > version.  So that's the
> > version of Jetty with which I will downgrade the Solr example app to.
> >
> > Bill
> >
> >
> > On 5/5/06, Erik Hatcher < [hidden email]> wrote:
> > >
> > > Along these lines, locally I've been using the latest stable version
> > > of Jetty and it has worked fine, but I did see an "out of memory"
> > > exception the other day but have not seen it since so I'm not sure
> > > what caused it.
> > >
> > > Moving to Tomcat, as long as we can configure it to be as lightweight
> > > as possible, is quite fine to me as well.
> > >
> > >         Erik
> > >
> > >
> > > On May 5, 2006, at 12:12 PM, Bill Au wrote:
> > >
> > > > There seems to be a fair number of folks using the jetty with the
> > > > example
> > > > app
> > > > as oppose to using Solr with their own appserver.  So I think it is
> > > > best to
> > > > use a stable version of Jetty instead of the beta.  If no one
> > > > objects, I can
> > > > go ahead and take care of this.
> > > >
> > > > Bill
> > > >
> > > > On 5/4/06, Yonik Seeley <[hidden email]> wrote:
> > > >>
> > > >> I verified that Tomcat 5.5.17 doesn't experience this problem.
> > > >>
> > > >> -Yonik
> > > >>
> > > >> On 5/4/06, Yonik Seeley <[hidden email]> wrote:
> > > >> > On 5/3/06, Yonik Seeley < [hidden email]> wrote:
> > > >> > > I just tried sending in 100,000 deletes and it didn't cause a
> > > >> problem:
> > > >> > > the memory grew from 22M to 30M.
> > > >> > >
> > > >> > > Random thought: perhaps it has something to do with how you are
> > >
> > > >> > > sending your requests?
> > > >> >
> > > >> > Yep, I was able to reproduce a memory problem w/ Jetty on Linux
> > > >> when
> > > >> > using non-persistent connections (closed after each
> > > request).  The
> > > >> > same 100,000 deletes blew up the JVM to 1GB heap.
> > > >> >
> > > >> > So this looks like it could be a Jetty problem (shame on me for
> > > >> using a
> > > >> beta).
> > > >> > I'm still not quite sure what changed in Solr that could make it
> > > >> > appear in later version and not in earlier versions though... the
> > > >> > version of Jetty is the same.
> > > >>
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

One big XML file vs. many HTTP requests

Michael Levy-3
In reply to this post by Yonik Seeley
Greetings,

I'm evaluating using Solr under Tomcat to replace a number of text
searching projects that currently use UMASS's INQUERY, an older search
engine.

One nice feature of INQUERY is that you can create one large SGML file,
containing lots of records, each bracketed with <DOC> and </DOC> tags.  
Submitting that big SGML document for indexing goes very fast.

I believe that Solr indexes one document at a time; each document
requires a separate HTTP POST.

How efficient is making a separate HTTP request per-document, when there
are millions of documents?  Do people ever use Solr's or Lucene's API
directly for indexing large numbers of documents, and if so, what are
the considerations pro and con?

Thanks to Yonik and Chris everyone for all your work; Solr looks really
great.

Reply | Threaded
Open this post in threaded view
|

Re: One big XML file vs. many HTTP requests

Yonik Seeley
On 5/12/06, Michael Levy <[hidden email]> wrote:
> How efficient is making a separate HTTP request per-document, when there
> are millions of documents?

If you use persistent connections and add make multiple requests in
parallel, there won't be much difference than multiple docs per
request.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: One big XML file vs. many HTTP requests

Erik Hatcher
In reply to this post by Michael Levy-3

On May 12, 2006, at 1:02 PM, Michael Levy wrote:
> One nice feature of INQUERY is that you can create one large SGML  
> file, containing lots of records, each bracketed with <DOC> and </
> DOC> tags.  Submitting that big SGML document for indexing goes  
> very fast.
> I believe that Solr indexes one document at a time; each document  
> requires a separate HTTP POST.

Actually adding multiple documents per POST is possible

> How efficient is making a separate HTTP request per-document, when  
> there are millions of documents?  Do people ever use Solr's or  
> Lucene's API directly for indexing large numbers of documents, and  
> if so, what are the considerations pro and con?

Maybe Solr could evolve a facility for doing these types of bulk  
operations without HTTP, but still using Solr's engine somehow via  
API directly.  I guess this gets tricky when you have a live Solr  
system up and juggling write locks though.

But currently going through HTTP is the only way, and likely to not  
be that much of a bottleneck especially given you can post multiple  
documents at a time (the wiki has an example, but I can't get to the  
web at the moment to post the link).

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: Java heap space

Marcus Stratmann
In reply to this post by Marcus Stratmann
On 5/4/06, I wrote:
> From my point of view it looks like this: Revision 393957 works while
> the latest revision cause problems. I don't know what part of the
> distribution causes the problems but I will try to find out. I think a
> good start would be to find out which was the first revision not working
> for me. Maybe this would be enough information for you to find out what
> had been changed at this point and what causes the problems.
(As a reminder, this was a problem with Jetty.)
Unfortunately I was not able to figure out what was going on. I
compiled some newer revisions from may but my problem with deleting a
huge amount of documents did not appear again. Maybe this is because I
changed the configuration a bit, adding "omitNorms=true" for some
fields.

Meanwhile I switched over to tomcat 5.5 as application server and
things seem to go fine now. The only situation I get OutOfMemory
errors is after an optimize when the server performs an auto-warming
of the cahces:
SEVERE: Error during auto-warming of key:org.apache.solr.search.QueryResultKey@2b14e8b7:java.lang.OutOfMemoryError: Java heap space
(from the tomcat log)
But nevertheless the server seems to run stable now with nearly 11
million documents.

Thanks to all the friendly people helping me so far!
Marcus


Reply | Threaded
Open this post in threaded view
|

Re: Java heap space

Yonik Seeley
On 5/15/06, Marcus Stratmann <[hidden email]> wrote:

> The only situation I get OutOfMemory
> errors is after an optimize when the server performs an auto-warming
> of the cahces:

A single filter that is big enough to be represented as a bitset
(>3000 in general) will take up 1.3MB

Some ways to help memory:
  - increase the heap size ;-)
  - make sure you don't have autowarming for more than one searcher
happening at a time.  If this happens, you should see something in
your logs like "PERFORMANCE WARNING: Overlapping onDeckSearchers=2"
  - do omitNorms on every field you can... every field with norms will
take up 1 byte per document (11MB in your case)
  - make caches smaller if you can survive the performance hit... A
single filter that is represented as a BitSet will take up 1.3MB for
11M docs (and bigger in the case that maxDocs is larger tha numDocs
because of deletions).


-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: One big XML file vs. many HTTP requests

Marcus Stratmann
In reply to this post by Erik Hatcher
Erik Hatcher wrote:
>> I believe that Solr indexes one document at a time; each document  
>> requires a separate HTTP POST.
> Actually adding multiple documents per POST is possible
But deleting multiple documents with just one POST is not possible,
right? Is there a special reason for that or is it because nobody asked
for that yet? If so: I'd like to have it! ;-)

Thanks to Erik for the hint!

Marcus
Reply | Threaded
Open this post in threaded view
|

Re: One big XML file vs. many HTTP requests

Chris Hostetter-3

: But deleting multiple documents with just one POST is not possible,
: right? Is there a special reason for that or is it because nobody asked

delete by query will remoe multiple documents with a sigle command .. but
if you mean dleete by id .. you may be right about it not having the same
"loop" kludge that <add> has.

As Yonik has mentioned before .. if you use persistent connections in your
HTTP Client layer, there isn't really any advantage to sending multiple
commands in one request, vs sending multiple requests.

-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: One big XML file vs. many HTTP requests

Michael Levy-3
It seems you can do something like
<delete><query>FIELDNAME:a*</query></delete>
and
<delete><query>FIELDNAME:b*</query></delete>
...but you can't simply
<delete><query>FIELDNAME:*</query></delete>
or
 <delete><query>*</query></delete>

The demo post.sh returns <result status="400">Error parsing Lucene
query</result>
and the demo Solr Admin page shows
XML Parsing Error: syntax error
Location:
http://wiki.ushmm.org:8080/solr/select/?stylesheet=&q=*&version=2.1&start=0&rows=10&indent=on
Line Number 1, Column 1:org.apache.solr.core.SolrException: Error parsing
Lucene query
^

What is the best way to delete all records, for example if you want to clear
out the entire index and reindex everything?


On 5/21/06, Chris Hostetter <[hidden email]> wrote:

>
>
> : But deleting multiple documents with just one POST is not possible,
> : right? Is there a special reason for that or is it because nobody asked
>
> delete by query will remoe multiple documents with a sigle command .. but
> if you mean dleete by id .. you may be right about it not having the same
> "loop" kludge that <add> has.
>
> As Yonik has mentioned before .. if you use persistent connections in your
> HTTP Client layer, there isn't really any advantage to sending multiple
> commands in one request, vs sending multiple requests.
>
> -Hoss
>
>
Reply | Threaded
Open this post in threaded view
|

Re: One big XML file vs. many HTTP requests

Chris Hostetter-3

: ...but you can't simply
: <delete><query>FIELDNAME:*</query></delete>
: or
:  <delete><query>*</query></delete>

That's because the Lucene query parser doesn't support 100% wildcard
queries.

: What is the best way to delete all records, for example if you want to clear
: out the entire index and reindex everything?

if you really want to make sure *EVERYTHING* is gone -- delete the index
directory and bounce the port, solr will make a new one.

if you have a uniqueKey field, or some other field you are sure every
document contains an indexed value for, then just do an unbounded range
query on that field...

         <delete><query>FIELDNAME:[* TO *]</query></delete>


-Hoss

Reply | Threaded
Open this post in threaded view
|

Range vs Term lookup

maustin
In reply to this post by Yonik Seeley
I'm doing a search based on price and was wondering what the performance
difference would be between these two queries:

1) +price:[0 TO 20]
2) +price:4567

Basically, to do a search with a range or pre-determine the range and do a
search based on an id?  It would be easier to set up with a range, however I
don't want to lose performance.

Thanks,
Mike

Reply | Threaded
Open this post in threaded view
|

Re: Range vs Term lookup

Yonik Seeley
On 5/30/06, maustin75 <[hidden email]> wrote:
> I'm doing a search based on price and was wondering what the performance
> difference would be between these two queries:
>
> 1) +price:[0 TO 20]
> 2) +price:4567
>
> Basically, to do a search with a range or pre-determine the range and do a
> search based on an id?

The same speed if they are in Solr's cache :-)
Range query will be slightly slower, but if it becomes a bottleneck or
not depends on the total complexity of the queries/requests.

> It would be easier to set up with a range, however I
> don't want to lose performance.

Start with the more flexible range query and only optimize if
necessary.  Both should be relatively quick.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Range vs Term lookup

maustin
> The same speed if they are in Solr's cache :-)
> Range query will be slightly slower, but if it becomes a bottleneck or
> not depends on the total complexity of the queries/requests.

What does the cache use as a key to determine if it is cached?

For example, how many bitsets are cached here with these to searches?

"+id:test1 +id:test2"
and.
"+id:test2 +id:test1"


Does solr parse and cache each queryitem or the entire search query?

> Start with the more flexible range query and only optimize if
> necessary.  Both should be relatively quick.

Yeah.. I think that is what I will do.
Reply | Threaded
Open this post in threaded view
|

Re: Range vs Term lookup

Yonik Seeley
On 5/30/06, maustin75 <[hidden email]> wrote:

> > The same speed if they are in Solr's cache :-)
> > Range query will be slightly slower, but if it becomes a bottleneck or
> > not depends on the total complexity of the queries/requests.
>
> What does the cache use as a key to determine if it is cached?
>
> For example, how many bitsets are cached here with these to searches?
>
> "+id:test1 +id:test2"
> and.
> "+id:test2 +id:test1"

The Query acts as the cache key.
There is currently no normalization done to make "a AND b" equiv to "b AND a".
This often isn't much of a problem for programatically generated
queries since they are generated the same way each time.

> Does solr parse and cache each queryitem or the entire search query?

The entire query, filter, and sort acts as the cache key into the
queryResult cache.  The results of that query aren't fully cached
though, only a subset.

Here is the relevant portion of solrconfig.xml:
   <!-- An optimization for use with the queryResultCache.  When a search
         is requested, a superset of the requested number of document ids
         are collected.  For example, of a search for a particular query
         requests matching documents 10 through 19, and queryWindowSize is 50,
         then documents 0 through 50 will be collected and cached.  Any further
         requests in that range can be satisfied via the cache.  -->
    <queryResultWindowSize>10</queryResultWindowSize>


-Yonik
12