[jira] Created: (SOLR-2186) DataImportHandler multi-threaded option throws exception

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (SOLR-2186) DataImportHandler multi-threaded option throws exception

Jason Grey (Jira)
DataImportHandler multi-threaded option throws exception
--------------------------------------------------------

                 Key: SOLR-2186
                 URL: https://issues.apache.org/jira/browse/SOLR-2186
             Project: Solr
          Issue Type: Bug
          Components: contrib - DataImportHandler
            Reporter: Lance Norskog


The multi-threaded option for the DataImportHandler throws an exception and the entire operation fails. This is true even if only 1 thread is configured via *threads='1'*


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-2186) DataImportHandler multi-threaded option throws exception

Jason Grey (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923764#action_12923764 ]

Lance Norskog commented on SOLR-2186:
-------------------------------------

This is the stack trace. The operation configures 4 threads and then does a full-import:

Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.DocBuilder doFullDump
INFO: running multithreaded full-import
Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper nextRow
INFO: arow : {fileSize=18837, fileLastModified=Wed Nov 21 08:15:23 PST 2007, fileAbsolutePath=/lucid/private_pdfs/10.pdfs/10.1.1.10.1.pdf, fileDir=/lucid/private_pdfs/10.pdfs, file=10.1.1.10.1.pdf}
Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper nextRow
INFO: arow : {fileSize=289898, fileLastModified=Wed Nov 21 08:15:25 PST 2007, fileAbsolutePath=/lucid/private_pdfs/10.pdfs/10.1.1.10.10.pdf, fileDir=/lucid/private_pdfs/10.pdfs, file=10.1.1.10.10.pdf}
Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper nextRow
INFO: arow : {fileSize=121847, fileLastModified=Wed Nov 21 08:15:43 PST 2007, fileAbsolutePath=/lucid/private_pdfs/10.pdfs/10.1.1.10.100.pdf, fileDir=/lucid/private_pdfs/10.pdfs, file=10.1.1.10.100.pdf}
Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.ThreadedEntityProcessorWrapper nextRow
INFO: arow : {fileSize=59844, fileLastModified=Wed Nov 21 08:18:49 PST 2007, fileAbsolutePath=/lucid/private_pdfs/10.pdfs/10.1.1.10.1000.pdf, fileDir=/lucid/private_pdfs/10.pdfs, file=10.1.1.10.1000.pdf}
Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.DocBuilder doFullDump
SEVERE: error in import
java.lang.NullPointerException
 at org.apache.solr.handler.dataimport.ContextImpl.getResolvedEntityAttribute(ContextImpl.java:79)
 at org.apache.solr.handler.dataimport.ThreadedContext.getResolvedEntityAttribute(ThreadedContext.java:78)
 at org.apache.solr.handler.dataimport.TikaEntityProcessor.firstInit(TikaEntityProcessor.java:67)
 at org.apache.solr.handler.dataimport.EntityProcessorBase.init(EntityProcessorBase.java:56)
 at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.initEntity(DocBuilder.java:507)
 at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:425)
 at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.run(DocBuilder.java:386)
 at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.runAThread(DocBuilder.java:453)
 at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner.access$000(DocBuilder.java:340)
 at org.apache.solr.handler.dataimport.DocBuilder$EntityRunner$1.run(DocBuilder.java:393)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:619)
Oct 21, 2010 10:21:16 PM org.apache.solr.handler.dataimport.DocBuilder finish
INFO: Import completed successfully
Oct 21, 2010 10:21:16 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)


> DataImportHandler multi-threaded option throws exception
> --------------------------------------------------------
>
>                 Key: SOLR-2186
>                 URL: https://issues.apache.org/jira/browse/SOLR-2186
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>            Reporter: Lance Norskog
>
> The multi-threaded option for the DataImportHandler throws an exception and the entire operation fails. This is true even if only 1 thread is configured via *threads='1'*

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-2186) DataImportHandler multi-threaded option throws exception

Jason Grey (Jira)
In reply to this post by Jason Grey (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923765#action_12923765 ]

Lance Norskog commented on SOLR-2186:
-------------------------------------

This is the dataConfig.xml. It is very simple: it walks a directory and indexes every PDF file it finds.
If you change threads='4' to threads='1', it will still fail. If you remove the threads directive, it runs.

<dataConfig>
   <dataSource type="BinFileDataSource"/>
   <document>
     <entity name="jc" dataSource="null"
             pk="id"
             processor="FileListEntityProcessor"
             fileName="^.*\.pdf$" recursive="false"
             baseDir="/lucid/private_pdfs/10.pdfs"
             transformer="TemplateTransformer"
             threads='4'
             >

        <field column="id" template="${jc.fileAbsolutePath}"/>

        <entity name="tika-test" processor="TikaEntityProcessor"
                url="${jc.fileAbsolutePath}"
                parser="org.apache.tika.parser.pdf.PDFParser"
                onError="skip"
                >
                <field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <field column="text" name="text"/>
        </entity>
      </entity>
    </document>
</dataConfig>


> DataImportHandler multi-threaded option throws exception
> --------------------------------------------------------
>
>                 Key: SOLR-2186
>                 URL: https://issues.apache.org/jira/browse/SOLR-2186
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>            Reporter: Lance Norskog
>
> The multi-threaded option for the DataImportHandler throws an exception and the entire operation fails. This is true even if only 1 thread is configured via *threads='1'*

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (SOLR-2186) DataImportHandler multi-threaded option throws exception

Jason Grey (Jira)
In reply to this post by Jason Grey (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923765#action_12923765 ]

Lance Norskog edited comment on SOLR-2186 at 10/23/10 8:47 PM:
---------------------------------------------------------------

This is the dataConfig.xml. It is very simple: it walks a directory and indexes every PDF file it finds.
If you change threads='4' to threads='1', it will still fail. If you remove the threads directive, it runs.

{noformat}
<dataConfig>
   <dataSource type="BinFileDataSource"/>
   <document>
     <entity name="jc" dataSource="null"
             pk="id"
             processor="FileListEntityProcessor"
             fileName="^.*\.pdf$" recursive="false"
             baseDir="/lucid/private_pdfs/10.pdfs"
             transformer="TemplateTransformer"
             threads='4'
             >

        <field column="id" template="${jc.fileAbsolutePath}"/>

        <entity name="tika-test" processor="TikaEntityProcessor"
                url="${jc.fileAbsolutePath}"
                parser="org.apache.tika.parser.pdf.PDFParser"
                onError="skip"
                >
                <field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <field column="text" name="text"/>
        </entity>
      </entity>
    </document>
</dataConfig>
{noformat}

      was (Author: lancenorskog):
    This is the dataConfig.xml. It is very simple: it walks a directory and indexes every PDF file it finds.
If you change threads='4' to threads='1', it will still fail. If you remove the threads directive, it runs.

<dataConfig>
   <dataSource type="BinFileDataSource"/>
   <document>
     <entity name="jc" dataSource="null"
             pk="id"
             processor="FileListEntityProcessor"
             fileName="^.*\.pdf$" recursive="false"
             baseDir="/lucid/private_pdfs/10.pdfs"
             transformer="TemplateTransformer"
             threads='4'
             >

        <field column="id" template="${jc.fileAbsolutePath}"/>

        <entity name="tika-test" processor="TikaEntityProcessor"
                url="${jc.fileAbsolutePath}"
                parser="org.apache.tika.parser.pdf.PDFParser"
                onError="skip"
                >
                <field column="Author" name="author" meta="true"/>
                <field column="title" name="title" meta="true"/>
                <field column="text" name="text"/>
        </entity>
      </entity>
    </document>
</dataConfig>

 

> DataImportHandler multi-threaded option throws exception
> --------------------------------------------------------
>
>                 Key: SOLR-2186
>                 URL: https://issues.apache.org/jira/browse/SOLR-2186
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>            Reporter: Lance Norskog
>
> The multi-threaded option for the DataImportHandler throws an exception and the entire operation fails. This is true even if only 1 thread is configured via *threads='1'*

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (SOLR-2186) DataImportHandler multi-threaded option throws exception

Jason Grey (Jira)
In reply to this post by Jason Grey (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924284#action_12924284 ]

Lance Norskog commented on SOLR-2186:
-------------------------------------

I've tracked it down. The ThreadedContext object is built without a resolver. There is a notation that the resolver will be set dynamicall but it is not.

The ThreadedContext resolver is called in the "firstInit" methods TikaEntityProcessor, LineEntityProcessor, and XPathEntityProcessor. TikaEntityProcessor also calls it in nextRow.

public class ThreadedContext extends ContextImpl{
  private DocBuilder.EntityRunner entityRunner;
  private boolean limitedContext = false;

  public ThreadedContext(DocBuilder.EntityRunner entityRunner, DocBuilder docBuilder) {
    super(entityRunner.entity,
            null,//to be fethed realtime
            null,
            null,
            docBuilder.session,
            null,
            docBuilder);
    this.entityRunner = entityRunner;
  }

I hacked DocBuilder.java to throw in a resolver and that allowed the TikaEP to function during firstInit. Then, the entity attribute resolver failed in the nextRow method.

TikaEP is the only class that calls the entity attribute resolver outside of the firstInit() call. Is it possible to change TikeEP to only use the resolver in firstInit?


> DataImportHandler multi-threaded option throws exception
> --------------------------------------------------------
>
>                 Key: SOLR-2186
>                 URL: https://issues.apache.org/jira/browse/SOLR-2186
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>            Reporter: Lance Norskog
>
> The multi-threaded option for the DataImportHandler throws an exception and the entire operation fails. This is true even if only 1 thread is configured via *threads='1'*

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (SOLR-2186) DataImportHandler multi-threaded option throws exception

Jason Grey (Jira)
In reply to this post by Jason Grey (Jira)

     [ https://issues.apache.org/jira/browse/SOLR-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lance Norskog updated SOLR-2186:
--------------------------------

    Attachment: TikaResolver.patch

This patch file fixes up the DataImportHandler so that the TikaEntityProcessor works under threads.

The technique is to pass in a resolver when creating a ThreadedContext (wrapper). This allows TikaEP.firstInit() to work. However, TikaEP.nextRow is called with a context without a functioning resolver, so: TikeEP caches the resolver given in firstInit() and uses it during nextRow() instead of using the one it should use.

This is not intended as fix patch;  it merely demonstrates the problem.

The patch is made with 'git diff' and I still haven't mastered it; some 'patch' programs may not like it.





> DataImportHandler multi-threaded option throws exception
> --------------------------------------------------------
>
>                 Key: SOLR-2186
>                 URL: https://issues.apache.org/jira/browse/SOLR-2186
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>            Reporter: Lance Norskog
>         Attachments: TikaResolver.patch
>
>
> The multi-threaded option for the DataImportHandler throws an exception and the entire operation fails. This is true even if only 1 thread is configured via *threads='1'*

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (SOLR-2186) DataImportHandler multi-threaded option throws exception

Jason Grey (Jira)
In reply to this post by Jason Grey (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924838#action_12924838 ]

Lance Norskog edited comment on SOLR-2186 at 10/26/10 1:20 AM:
---------------------------------------------------------------

This patch file fixes up the DataImportHandler so that the TikaEntityProcessor works under threads.

The technique is to pass in a resolver when creating a ThreadedContext (wrapper). This allows TikaEP.firstInit() to work. However, TikaEP.nextRow is called with a context without a functioning resolver, so: TikeEP caches the resolver given in firstInit() and uses it during nextRow() instead of using the one it should use. Even so, the parsed text is spewed to the logger in addition to being indexed.

This is not intended as fix patch;  it merely demonstrates the problem.

The patch is made with 'git diff' and I still haven't mastered it; some 'patch' programs may not like it.





      was (Author: lancenorskog):
    This patch file fixes up the DataImportHandler so that the TikaEntityProcessor works under threads.

The technique is to pass in a resolver when creating a ThreadedContext (wrapper). This allows TikaEP.firstInit() to work. However, TikaEP.nextRow is called with a context without a functioning resolver, so: TikeEP caches the resolver given in firstInit() and uses it during nextRow() instead of using the one it should use.

This is not intended as fix patch;  it merely demonstrates the problem.

The patch is made with 'git diff' and I still haven't mastered it; some 'patch' programs may not like it.




 

> DataImportHandler multi-threaded option throws exception
> --------------------------------------------------------
>
>                 Key: SOLR-2186
>                 URL: https://issues.apache.org/jira/browse/SOLR-2186
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>            Reporter: Lance Norskog
>         Attachments: TikaResolver.patch
>
>
> The multi-threaded option for the DataImportHandler throws an exception and the entire operation fails. This is true even if only 1 thread is configured via *threads='1'*

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]