Previous Topic Next Topic
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view


[All off latest code from SVN]

Ok, I'm just writing this email because I'm confused about some code in Lucene, which will almost certainly highlight a failure of mine to fully grasp something probably very basic.  I've had the Lucene in Action book for a while, and generally been playing around with Lucene on the side, but now we have a serious need for Lucene to index a lot of stuff.

I keep hearing people say "can index millions of docs in a handful of minutes", but I can't see this in action, even after fiddling with the mergeFactors/maxMergeDocs etc etc. (even tried RAMDirectory batching, but it may be quicker, but it's not that much quicker for me).

So, I'm posting my code and analysis, and the hprof trace from it in an attempt to have as many people look at my code, laugh, and point out my failings.  Please be gentle with me... :)

First, here's my indexing testbed code (please ignore the RAMwriter, I know it's not used in this example, but I was trialling things at times):

   private static void createIndex(final Logger LOG, File indexDir) throws IOException, SQLException {
       Connection connection;
       IndexWriter writer;
       boolean b = indexDir.mkdirs();
       LOG.info("Index=" + indexDir.getAbsolutePath());
       Analyzer a = new StandardAnalyzer();
       writer = new IndexWriter(indexDir, a, true);
       connection = DriverManager.getConnection(
               "jdbc:inetdae7:tower.aconex.com?database=mel20050407l", "*****",
       String sql = "select userid, userfirstname, userlastname, email from userx";
       LOG.info("sql=" + sql);
       Statement statement = connection.createStatement();
       LOG.info("Executing sql");
       ResultSet rs = statement.executeQuery(sql);
       LOG.info("ResultSet retrieved");
       int row = 0;
//        final int MAX_ROWS = 100000;
//        SimpleDateFormat df = new SimpleDateFormat("yyyyMMdd");
       RAMDirectory ramDir = new RAMDirectory();
       IndexWriter ramWriter = new IndexWriter(ramDir, a, true);
      //        ramWriter.setMaxBufferedDocs(10000);
//        ramWriter.setMergeFactor(1000);
       LOG.info("Indexing users");
       long begin = System.currentTimeMillis();
       while (rs.next()) {
           int userid = rs.getInt(1);
           String firstname = rs.getString(2);
           String lastname = rs.getString(3);
           String email = rs.getString(4);
           String fullName = firstname + " " + lastname;
           Document doc = new Document();
           doc.add(Field.Keyword("userid", userid+""));
           doc.add(Field.Keyword("firstname", firstname.toLowerCase()));
           doc.add(Field.Keyword("lastname", lastname.toLowerCase()));
           doc.add(Field.Text("name", fullName.toLowerCase()));
           doc.add(Field.Keyword("email", email.toLowerCase()));
           if((row % 100)==0){
               LOG.info(row + " indexed");
       double end = System.currentTimeMillis();
       double diff = (end-begin)/1000;
       double rate = row/diff;
       LOG.info("rate:" +rate);
       LOG.info("Finished RAM index, creating file index, " + row + " users indexed");
       //writer.addIndexes(new Directory[]{ramDir});
       LOG.info("Completed indexing");

Even changing between Compound and not compound format doesn't really change the fact that my Mac Powerbook, as well as  dual Xeon 2.4 Ghz, both with at least a GB of RAM are effectively CPU maxed out (only 1 CPU on the dual Xeon is maxed out obviously)..

The attached hprof logs from runs using the "-Xrunhprof:cpu=samples,file=log.txt,depth=3" syntax highlight that:

  1 48.94% 48.94%    1061   101 org.apache.lucene.index.IndexWriter.maybeMergeSegments

This reads to me that it's spending a LOT of time in this method.  Looking at the current code in SVN, and debugging through, I see that as each Document is added to this IndexWriter, the loop in maybeMergeSegments gets called, and the loop spends a busy time looping through each SegmentInfo instance.  In my case, it seems that there is only one document in each SegmentInfo .  This means that each time a new document is added, this loop gets bigger and bigger.  Perhaps someone can explain why SegmentInfo only contains 1 doc during this time?

Now I'm sure there is a good reason for this loop, but I'm just highlighting that this seems to be the biggest CPU consumer of the indexing operation as I see it.  (again, it could just be my test setup).

see also log.log which is the output of log4j as each 100 items is added (ignore the "Finished RAM indexing.." log, you can see from above it's not using the RAMWriter in this case.

Please, observations about why my testing setup is not accurate very welcome.  I just want to understand why I'm CPU bound.  At this rate I'm only getting on average 300-400-ish items/second added to the index.  Of course a faster CPU is always a good idea, but if there _is_ a way to optimize this area, one could see a significant indexing speed increase? (move the bottleneck back to IO I guess).


Paul Smith

log.tar.gz (46K) Download Attachment
Reply | Threaded
Open this post in threaded view

Re: [Performanc]

Daniel Naber
On Friday 29 April 2005 04:15, Paul Smith wrote:

> At this rate I'm
> only getting on average 300-400-ish items/second added to the index.

I think that's realistic for typical uses of Lucene on common hardware.

> I keep hearing people say "can index millions of docs in a handful of
> minutes"

Not sure where you got this from, but this will only be true in very rare
special cases.



To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view

Re: [Performanc]


>>At this rate I'm
>>only getting on average 300-400-ish items/second added to the index.
>I think that's realistic for typical uses of Lucene on common hardware.
Thanks Daniel, that's at least comforting to know that it's at least
expected.  Can you or anyone else comment on the CPU profile I sent in?  
If there was a way of optimizing that loop, then it could mean a
reasonable improvement in indexing speed.


Paul Smith