State of DIH, concurrency

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

State of DIH, concurrency

aanno.trash
Hello,

I looked a bit into the code of DIH (solr dataimporthandler and
dataimporthandler-extra). I wonder what is the state of this code. It is
in a 'contrib' folder and seems to work (and maintained). But is there
ongoing development (e.g. additional features)?

The reason I'm asking is that I'm in a project where DIH is used.
However, the import is very slow, especially into a solr cluster. I
glanced over the code for my case and it looks like DIH is only
single-threaded. I guess that changing DIH to support multi-threading on
the 'root' (top level) entity should result in a dramatic performance boost.

Hence I hacked DIH a bit. To get started, I concentrated on the 'tika'
example case with a bunch of private PDFs and only for a 'full-import'.
From this (dirty) experiment, a multi-threaded DIH seems to be possible.
However, some bigger code changes are needed. This is a incomplete list:

* Make VariableResolver immutable and change its interface/contract
* All EntityProcessors seems to be written with only a single-thread in
mind. I circumvented the problem by (a) supporting a clone operation and
(b) cloning the EntityProcessors for each thread.
* To get the code more handy, I introduced several interfaces where only
complete abstract classes has been around before (Context, DataSource,
DIHProperties, EntityProcessor, ...). Perhaps this in not absolutely
needed but has simplified the refactoring substantially.

So this is my question: Would you consider the contribution of a BIG DIH
change for merging into the project? Or is DIH just dead and should go
away soon? And if you would consider the contribution, would it be best
with several small changes or with a 'big-bang' pull request? Would you
consider the contribution even if some features of DIH are dropped?
(From my experiment, a very hot candidate to drop is the
XPathEntityProcessor.)

Kind regards,

aanno2




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: State of DIH, concurrency

Ishan Chattopadhyaya

On Tue, 7 Jan, 2020, 4:20 PM aanno.trash, <[hidden email]> wrote:
Hello,

I looked a bit into the code of DIH (solr dataimporthandler and
dataimporthandler-extra). I wonder what is the state of this code. It is
in a 'contrib' folder and seems to work (and maintained). But is there
ongoing development (e.g. additional features)?

The reason I'm asking is that I'm in a project where DIH is used.
However, the import is very slow, especially into a solr cluster. I
glanced over the code for my case and it looks like DIH is only
single-threaded. I guess that changing DIH to support multi-threading on
the 'root' (top level) entity should result in a dramatic performance boost.

Hence I hacked DIH a bit. To get started, I concentrated on the 'tika'
example case with a bunch of private PDFs and only for a 'full-import'.
From this (dirty) experiment, a multi-threaded DIH seems to be possible.
However, some bigger code changes are needed. This is a incomplete list:

* Make VariableResolver immutable and change its interface/contract
* All EntityProcessors seems to be written with only a single-thread in
mind. I circumvented the problem by (a) supporting a clone operation and
(b) cloning the EntityProcessors for each thread.
* To get the code more handy, I introduced several interfaces where only
complete abstract classes has been around before (Context, DataSource,
DIHProperties, EntityProcessor, ...). Perhaps this in not absolutely
needed but has simplified the refactoring substantially.

So this is my question: Would you consider the contribution of a BIG DIH
change for merging into the project? Or is DIH just dead and should go
away soon? And if you would consider the contribution, would it be best
with several small changes or with a 'big-bang' pull request? Would you
consider the contribution even if some features of DIH are dropped?
(From my experiment, a very hot candidate to drop is the
XPathEntityProcessor.)

Kind regards,

aanno2




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: State of DIH, concurrency

Jörn Franke
In reply to this post by aanno.trash
I think it is deprecated in Solr 8.x and will disappear.


You can use Apache manifoldcf or a custom software to introduce parallelism.

> Am 07.01.2020 um 11:50 schrieb aanno.trash <[hidden email]>:
>
> Hello,
>
> I looked a bit into the code of DIH (solr dataimporthandler and
> dataimporthandler-extra). I wonder what is the state of this code. It is
> in a 'contrib' folder and seems to work (and maintained). But is there
> ongoing development (e.g. additional features)?
>
> The reason I'm asking is that I'm in a project where DIH is used.
> However, the import is very slow, especially into a solr cluster. I
> glanced over the code for my case and it looks like DIH is only
> single-threaded. I guess that changing DIH to support multi-threading on
> the 'root' (top level) entity should result in a dramatic performance boost.
>
> Hence I hacked DIH a bit. To get started, I concentrated on the 'tika'
> example case with a bunch of private PDFs and only for a 'full-import'.
> From this (dirty) experiment, a multi-threaded DIH seems to be possible.
> However, some bigger code changes are needed. This is a incomplete list:
>
> * Make VariableResolver immutable and change its interface/contract
> * All EntityProcessors seems to be written with only a single-thread in
> mind. I circumvented the problem by (a) supporting a clone operation and
> (b) cloning the EntityProcessors for each thread.
> * To get the code more handy, I introduced several interfaces where only
> complete abstract classes has been around before (Context, DataSource,
> DIHProperties, EntityProcessor, ...). Perhaps this in not absolutely
> needed but has simplified the refactoring substantially.
>
> So this is my question: Would you consider the contribution of a BIG DIH
> change for merging into the project? Or is DIH just dead and should go
> away soon? And if you would consider the contribution, would it be best
> with several small changes or with a 'big-bang' pull request? Would you
> consider the contribution even if some features of DIH are dropped?
> (From my experiment, a very hot candidate to drop is the
> XPathEntityProcessor.)
>
> Kind regards,
>
> aanno2
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: State of DIH, concurrency

Mikhail Khludnev-2
In reply to this post by aanno.trash
Hello, aanno2.

Don't start it. Threads were fixed to the certain level as 3.6.1 under https://issues.apache.org/jira/browse/SOLR-3360
But right after that threads were dropped out of DIH for overal sanity under https://issues.apache.org/jira/browse/SOLR-3262
If you really need to get certain level of concurrency, declare multiple DataImportHandlers in solrconfig.xml and submit multiple subrequest sharded with explicit filters in parallel. 

Good luck. You'd better to try any full-fledged ETL rather than bandaiding DIH.  


On Tue, Jan 7, 2020 at 1:50 PM aanno.trash <[hidden email]> wrote:
Hello,

I looked a bit into the code of DIH (solr dataimporthandler and
dataimporthandler-extra). I wonder what is the state of this code. It is
in a 'contrib' folder and seems to work (and maintained). But is there
ongoing development (e.g. additional features)?

The reason I'm asking is that I'm in a project where DIH is used.
However, the import is very slow, especially into a solr cluster. I
glanced over the code for my case and it looks like DIH is only
single-threaded. I guess that changing DIH to support multi-threading on
the 'root' (top level) entity should result in a dramatic performance boost.

Hence I hacked DIH a bit. To get started, I concentrated on the 'tika'
example case with a bunch of private PDFs and only for a 'full-import'.
From this (dirty) experiment, a multi-threaded DIH seems to be possible.
However, some bigger code changes are needed. This is a incomplete list:

* Make VariableResolver immutable and change its interface/contract
* All EntityProcessors seems to be written with only a single-thread in
mind. I circumvented the problem by (a) supporting a clone operation and
(b) cloning the EntityProcessors for each thread.
* To get the code more handy, I introduced several interfaces where only
complete abstract classes has been around before (Context, DataSource,
DIHProperties, EntityProcessor, ...). Perhaps this in not absolutely
needed but has simplified the refactoring substantially.

So this is my question: Would you consider the contribution of a BIG DIH
change for merging into the project? Or is DIH just dead and should go
away soon? And if you would consider the contribution, would it be best
with several small changes or with a 'big-bang' pull request? Would you
consider the contribution even if some features of DIH are dropped?
(From my experiment, a very hot candidate to drop is the
XPathEntityProcessor.)

Kind regards,

aanno2




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



--
Sincerely yours
Mikhail Khludnev