DIH: Create Child Documents in ScriptTransformer

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

DIH: Create Child Documents in ScriptTransformer

Jörn Franke
Hi,

I load a set of documents. Based on these documents some logic needs to be
applied to split them into chapters (this is done). One whole document is
loaded as a parent. Chapters of the whole document + metadata should be
loaded as child documents of this parent.
I want to now collect information on how this can be done:
* Use a custom loader - this is possible and works
* Use DIH and extract the chapters in a ScriptTransformer and add them as
child documents there. However, the scripttransformer receives as input
only a HashMap and while it works to transform field values etc. It does
not seem possible to add childdocuments within the DIH scripttransformer. I
tried adding a JavaArray with SolrInputDocuments, but this does not seem to
work. I see in debug/verbose mode that indeed the transformer adds them to
the HashMap correctly, but they don't end up in the document. Maybe here it
could be possible somehow via nested entities?
* Use DIH+ an UpdateProcessor (Script): there i get the SolrInputDocument
as a parameter and it seems feasible to extract chapters and add them as
child documents.

thank you.

best regards
Reply | Threaded
Open this post in threaded view
|

Re: DIH: Create Child Documents in ScriptTransformer

Erick Erickson
When it starts getting complex, I usually move to SolrJ. You say
you're loading documents, so I assume Tika is in the mix too.

Here's a blog on the topic so you an see how to get started...

https://lucidworks.com/post/indexing-with-solrj/

Best,
Erick

On Wed, Sep 18, 2019 at 2:56 PM Jörn Franke <[hidden email]> wrote:

>
> Hi,
>
> I load a set of documents. Based on these documents some logic needs to be
> applied to split them into chapters (this is done). One whole document is
> loaded as a parent. Chapters of the whole document + metadata should be
> loaded as child documents of this parent.
> I want to now collect information on how this can be done:
> * Use a custom loader - this is possible and works
> * Use DIH and extract the chapters in a ScriptTransformer and add them as
> child documents there. However, the scripttransformer receives as input
> only a HashMap and while it works to transform field values etc. It does
> not seem possible to add childdocuments within the DIH scripttransformer. I
> tried adding a JavaArray with SolrInputDocuments, but this does not seem to
> work. I see in debug/verbose mode that indeed the transformer adds them to
> the HashMap correctly, but they don't end up in the document. Maybe here it
> could be possible somehow via nested entities?
> * Use DIH+ an UpdateProcessor (Script): there i get the SolrInputDocument
> as a parameter and it seems feasible to extract chapters and add them as
> child documents.
>
> thank you.
>
> best regards
Reply | Threaded
Open this post in threaded view
|

Re: DIH: Create Child Documents in ScriptTransformer

Jörn Franke
I fully agree. However, I am just curious to see the limits.

> Am 18.09.2019 um 23:33 schrieb Erick Erickson <[hidden email]>:
>
> When it starts getting complex, I usually move to SolrJ. You say
> you're loading documents, so I assume Tika is in the mix too.
>
> Here's a blog on the topic so you an see how to get started...
>
> https://lucidworks.com/post/indexing-with-solrj/
>
> Best,
> Erick
>
>> On Wed, Sep 18, 2019 at 2:56 PM Jörn Franke <[hidden email]> wrote:
>>
>> Hi,
>>
>> I load a set of documents. Based on these documents some logic needs to be
>> applied to split them into chapters (this is done). One whole document is
>> loaded as a parent. Chapters of the whole document + metadata should be
>> loaded as child documents of this parent.
>> I want to now collect information on how this can be done:
>> * Use a custom loader - this is possible and works
>> * Use DIH and extract the chapters in a ScriptTransformer and add them as
>> child documents there. However, the scripttransformer receives as input
>> only a HashMap and while it works to transform field values etc. It does
>> not seem possible to add childdocuments within the DIH scripttransformer. I
>> tried adding a JavaArray with SolrInputDocuments, but this does not seem to
>> work. I see in debug/verbose mode that indeed the transformer adds them to
>> the HashMap correctly, but they don't end up in the document. Maybe here it
>> could be possible somehow via nested entities?
>> * Use DIH+ an UpdateProcessor (Script): there i get the SolrInputDocument
>> as a parameter and it seems feasible to extract chapters and add them as
>> child documents.
>>
>> thank you.
>>
>> best regards
Reply | Threaded
Open this post in threaded view
|

Re: DIH: Create Child Documents in ScriptTransformer

Mikhail Khludnev-2
In reply to this post by Jörn Franke
Hello, Jörn.
Have you tried to find a parent doc in the context which is passed as a
second argument into ScriptTransformer?

On Wed, Sep 18, 2019 at 9:56 PM Jörn Franke <[hidden email]> wrote:

>
> Hi,
>
> I load a set of documents. Based on these documents some logic needs to be
> applied to split them into chapters (this is done). One whole document is
> loaded as a parent. Chapters of the whole document + metadata should be
> loaded as child documents of this parent.
> I want to now collect information on how this can be done:
> * Use a custom loader - this is possible and works
> * Use DIH and extract the chapters in a ScriptTransformer and add them as
> child documents there. However, the scripttransformer receives as input
> only a HashMap and while it works to transform field values etc. It does
> not seem possible to add childdocuments within the DIH scripttransformer.
I
> tried adding a JavaArray with SolrInputDocuments, but this does not seem
to
> work. I see in debug/verbose mode that indeed the transformer adds them to
> the HashMap correctly, but they don't end up in the document. Maybe here
it
> could be possible somehow via nested entities?
> * Use DIH+ an UpdateProcessor (Script): there i get the SolrInputDocument
> as a parameter and it seems feasible to extract chapters and add them as
> child documents.
>
> thank you.
>
> best regards



--
Sincerely yours
Mikhail Khludnev
Reply | Threaded
Open this post in threaded view
|

Re: DIH: Create Child Documents in ScriptTransformer

Jörn Franke
Hi,

thanks for all the feedback.
The context parameter in the ScriptTransformer is new to me - thanks for
this insight. I could not find it in any docs. So just for people that also
did not know it:
you can have the ScriptTransformer with 2 parameters, e.g.
function mytransformer(row,context){
....
}

The following Javadoc gives some hints on what you can do with the context:
https://lucene.apache.org/solr/8_2_0/solr-dataimporthandler/org/apache/solr/handler/dataimport/Context.html

Despite all this, I came to the conclusion that adding child docs in a
ScriptTransformer in DIH are not supported.

One can though use a StatelessScriptUpdateProcessFactory, see
https://lucene.apache.org/solr/8_2_0//solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html

and

https://cwiki.apache.org/confluence/display/solr/ScriptUpdateProcessor#ScriptUpdateProcessor-JavaScript

Hint on how to add child documents to a SolrInputDocument:
http://lucene.apache.org/solr/8_2_0/solr-solrj/index.html?org/apache/solr/common/SolrInputDocument.html


Nevertheless, I agree that one should use an external tool, which depending
on the needs can though also mean some complexity (e.g. supporting
individual transformations per collection without code, but
configuration/plugins etc.). While this is not a problem, it might be good
to start an open source loader that goes beyond the post tool (
https://lucene.apache.org/solr/guide/8_1/post-tool.html).

best regards

On Thu, Sep 19, 2019 at 8:54 AM Mikhail Khludnev <[hidden email]> wrote:

> Hello, Jörn.
> Have you tried to find a parent doc in the context which is passed as a
> second argument into ScriptTransformer?
>
> On Wed, Sep 18, 2019 at 9:56 PM Jörn Franke <[hidden email]> wrote:
> >
> > Hi,
> >
> > I load a set of documents. Based on these documents some logic needs to
> be
> > applied to split them into chapters (this is done). One whole document is
> > loaded as a parent. Chapters of the whole document + metadata should be
> > loaded as child documents of this parent.
> > I want to now collect information on how this can be done:
> > * Use a custom loader - this is possible and works
> > * Use DIH and extract the chapters in a ScriptTransformer and add them as
> > child documents there. However, the scripttransformer receives as input
> > only a HashMap and while it works to transform field values etc. It does
> > not seem possible to add childdocuments within the DIH scripttransformer.
> I
> > tried adding a JavaArray with SolrInputDocuments, but this does not seem
> to
> > work. I see in debug/verbose mode that indeed the transformer adds them
> to
> > the HashMap correctly, but they don't end up in the document. Maybe here
> it
> > could be possible somehow via nested entities?
> > * Use DIH+ an UpdateProcessor (Script): there i get the SolrInputDocument
> > as a parameter and it seems feasible to extract chapters and add them as
> > child documents.
> >
> > thank you.
> >
> > best regards
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>