[jira] Created: (NUTCH-473) ExcepExtractor performance bad due to String concatenation

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-473) ExcepExtractor performance bad due to String concatenation

ASF GitHub Bot (Jira)
ExcepExtractor performance bad due to String concatenation
----------------------------------------------------------

                 Key: NUTCH-473
                 URL: https://issues.apache.org/jira/browse/NUTCH-473
             Project: Nutch
          Issue Type: Improvement
          Components: indexer
    Affects Versions: 0.9.0
         Environment: Tested under Windows, Java 1.5 and 1.6
            Reporter: Antony Bowesman


Using 0.9 version of ExcelExtractor was still running after 4 hours at 100% CPU trying to extract the text from a 3MB Excel file containing 26 sheets, half with a matrix of approx 1100 rows x P columns and the others with approx 1000 rows x E columns.

After changing ExcelExtractor to use StringBuffer the same extraction process took 3 seconds under Java 1.5.  Code changes below - example uses a 4K buffer per sheet - this was a completely arbitrary choice but keeps the number of StringBuffer expansions low for large files without using too much space for small files.
 

  protected String extractText(InputStream input) throws Exception {
   
    String resultText = "";
    HSSFWorkbook wb = new HSSFWorkbook(input);
    if (wb == null) {
      return resultText;
    }
   
    HSSFSheet sheet;
    HSSFRow row;
    HSSFCell cell;
    int sNum = 0;
    int rNum = 0;
    int cNum = 0;
   
    sNum = wb.getNumberOfSheets();
   
    //  Allow 4K per sheet - seems a reasonable start
    StringBuffer sb = new StringBuffer(4096 * sNum);
    for (int i=0; i<sNum; i++) {
      if ((sheet = wb.getSheetAt(i)) == null) {
        continue;
      }
      rNum = sheet.getLastRowNum();
      for (int j=0; j<=rNum; j++) {
        if ((row = sheet.getRow(j)) == null){
          continue;
        }
        cNum = row.getLastCellNum();
       
        for (int k=0; k<cNum; k++) {
          if ((cell = row.getCell((short) k)) != null) {
            /*if(HSSFDateUtil.isCellDateFormatted(cell) == true) {
                resultText += cell.getDateCellValue().toString() + " ";
              } else
             */
            if (cell.getCellType() == HSSFCell.CELL_TYPE_STRING) {
                sb.append(cell.getStringCellValue());
                sb.append(' ');
//              resultText += cell.getStringCellValue() + " ";
            } else if (cell.getCellType() == HSSFCell.CELL_TYPE_NUMERIC) {
              Double d = new Double(cell.getNumericCellValue());
              sb.append(d.toString());
              sb.append(' ');
//              resultText += d.toString() + " ";
            }
            /* else if(cell.getCellType() == HSSFCell.CELL_TYPE_FORMULA){
                 resultText += cell.getCellFormula() + " ";
               }
             */
          }
        }
      }
    }
    return sb.toString();
  }
 


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-473) ExcelExtractor performance bad due to String concatenation

ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antony Bowesman updated NUTCH-473:
----------------------------------

    Summary: ExcelExtractor performance bad due to String concatenation  (was: ExcepExtractor performance bad due to String concatenation)

> ExcelExtractor performance bad due to String concatenation
> ----------------------------------------------------------
>
>                 Key: NUTCH-473
>                 URL: https://issues.apache.org/jira/browse/NUTCH-473
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 0.9.0
>         Environment: Tested under Windows, Java 1.5 and 1.6
>            Reporter: Antony Bowesman
>
> Using 0.9 version of ExcelExtractor was still running after 4 hours at 100% CPU trying to extract the text from a 3MB Excel file containing 26 sheets, half with a matrix of approx 1100 rows x P columns and the others with approx 1000 rows x E columns.
> After changing ExcelExtractor to use StringBuffer the same extraction process took 3 seconds under Java 1.5.  Code changes below - example uses a 4K buffer per sheet - this was a completely arbitrary choice but keeps the number of StringBuffer expansions low for large files without using too much space for small files.
>  
>   protected String extractText(InputStream input) throws Exception {
>    
>     String resultText = "";
>     HSSFWorkbook wb = new HSSFWorkbook(input);
>     if (wb == null) {
>       return resultText;
>     }
>    
>     HSSFSheet sheet;
>     HSSFRow row;
>     HSSFCell cell;
>     int sNum = 0;
>     int rNum = 0;
>     int cNum = 0;
>    
>     sNum = wb.getNumberOfSheets();
>    
>     //  Allow 4K per sheet - seems a reasonable start
>     StringBuffer sb = new StringBuffer(4096 * sNum);
>     for (int i=0; i<sNum; i++) {
>       if ((sheet = wb.getSheetAt(i)) == null) {
>         continue;
>       }
>       rNum = sheet.getLastRowNum();
>       for (int j=0; j<=rNum; j++) {
>         if ((row = sheet.getRow(j)) == null){
>           continue;
>         }
>         cNum = row.getLastCellNum();
>        
>         for (int k=0; k<cNum; k++) {
>           if ((cell = row.getCell((short) k)) != null) {
>             /*if(HSSFDateUtil.isCellDateFormatted(cell) == true) {
>                 resultText += cell.getDateCellValue().toString() + " ";
>               } else
>              */
>             if (cell.getCellType() == HSSFCell.CELL_TYPE_STRING) {
>                 sb.append(cell.getStringCellValue());
>                 sb.append(' ');
> //              resultText += cell.getStringCellValue() + " ";
>             } else if (cell.getCellType() == HSSFCell.CELL_TYPE_NUMERIC) {
>               Double d = new Double(cell.getNumericCellValue());
>               sb.append(d.toString());
>               sb.append(' ');
> //              resultText += d.toString() + " ";
>             }
>             /* else if(cell.getCellType() == HSSFCell.CELL_TYPE_FORMULA){
>                  resultText += cell.getCellFormula() + " ";
>                }
>              */
>           }
>         }
>       }
>     }
>     return sb.toString();
>   }
>  

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-473) ExcelExtractor performance bad due to String concatenation

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sami Siren resolved NUTCH-473.
------------------------------

    Resolution: Duplicate

duplicate of NUTCH-456

> ExcelExtractor performance bad due to String concatenation
> ----------------------------------------------------------
>
>                 Key: NUTCH-473
>                 URL: https://issues.apache.org/jira/browse/NUTCH-473
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 0.9.0
>         Environment: Tested under Windows, Java 1.5 and 1.6
>            Reporter: Antony Bowesman
>
> Using 0.9 version of ExcelExtractor was still running after 4 hours at 100% CPU trying to extract the text from a 3MB Excel file containing 26 sheets, half with a matrix of approx 1100 rows x P columns and the others with approx 1000 rows x E columns.
> After changing ExcelExtractor to use StringBuffer the same extraction process took 3 seconds under Java 1.5.  Code changes below - example uses a 4K buffer per sheet - this was a completely arbitrary choice but keeps the number of StringBuffer expansions low for large files without using too much space for small files.
>  
>   protected String extractText(InputStream input) throws Exception {
>    
>     String resultText = "";
>     HSSFWorkbook wb = new HSSFWorkbook(input);
>     if (wb == null) {
>       return resultText;
>     }
>    
>     HSSFSheet sheet;
>     HSSFRow row;
>     HSSFCell cell;
>     int sNum = 0;
>     int rNum = 0;
>     int cNum = 0;
>    
>     sNum = wb.getNumberOfSheets();
>    
>     //  Allow 4K per sheet - seems a reasonable start
>     StringBuffer sb = new StringBuffer(4096 * sNum);
>     for (int i=0; i<sNum; i++) {
>       if ((sheet = wb.getSheetAt(i)) == null) {
>         continue;
>       }
>       rNum = sheet.getLastRowNum();
>       for (int j=0; j<=rNum; j++) {
>         if ((row = sheet.getRow(j)) == null){
>           continue;
>         }
>         cNum = row.getLastCellNum();
>        
>         for (int k=0; k<cNum; k++) {
>           if ((cell = row.getCell((short) k)) != null) {
>             /*if(HSSFDateUtil.isCellDateFormatted(cell) == true) {
>                 resultText += cell.getDateCellValue().toString() + " ";
>               } else
>              */
>             if (cell.getCellType() == HSSFCell.CELL_TYPE_STRING) {
>                 sb.append(cell.getStringCellValue());
>                 sb.append(' ');
> //              resultText += cell.getStringCellValue() + " ";
>             } else if (cell.getCellType() == HSSFCell.CELL_TYPE_NUMERIC) {
>               Double d = new Double(cell.getNumericCellValue());
>               sb.append(d.toString());
>               sb.append(' ');
> //              resultText += d.toString() + " ";
>             }
>             /* else if(cell.getCellType() == HSSFCell.CELL_TYPE_FORMULA){
>                  resultText += cell.getCellFormula() + " ";
>                }
>              */
>           }
>         }
>       }
>     }
>     return sb.toString();
>   }
>  

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.