Ingesting/Querying Documents with Nested/Related Documents and extracting Full-text

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Ingesting/Querying Documents with Nested/Related Documents and extracting Full-text

Stephon Harris
Hi,

I want to ingest a collection of documents along with extracted full-text
from PDFs using solr 'update/extract' endpoint to store the text in a field
called "fullText". I want to relate some documents to other documents so
when I query the "fullText" field  with user terms, solr returns the first
matching document with "contentType" field equal to "overview", and several
related documents with different values for "contentType" like this:


{
    "id":"1",
    "contentType":"overview",
    "fullText":"Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Etiam consectetur ipsum libero, at egestas ante laoreet nec. Aliquam sem
elit, rhoncus efficitur laoreet sodales, hendrerit eget mi. Nulla facilisis
tincidunt tortor vel placerat. Phasellus blandit velit eget semper
tristique. Maecenas convallis orci purus, ac scelerisque erat pulvinar id.
Donec semper enim id justo cursus, vitae bibendum magna interdum. Maecenas
eu laoreet nibh. Quisque magna massa, semper et lorem sed, volutpat
pulvinar quam. Quisque a urna et risus feugiat fermentum nec et orci.
Pellentesque ac neque sed tortor convallis finibus sit amet id purus. Sed
blandit eget ante et semper. Vivamus.",
    "product":"paper & goods"
},
{
    "id":"2",
    "contentType":"support",
    "title":"The latest support boards",
    "points":["Nulla facilisis tincidunt tortor vel placerat."," Phasellus
blandit velit eget semper tristique."],
    "product":"paper & goods",
    "parentID":"1"
},{
    "id":"3",
    "contentType":"boards",
    "title":"",
    "points":["Nulla facilisis tincidunt tortor vel placerat."," Phasellus
blandit velit eget semper tristique."],
    "product":"paper & goods",
    "parentID":"1"
}


I'm looking for any recommendations on ingesting and querying these
documents. Can I ingest these documents by nesting child documents in the
overview document and also extract full-text from a PDF? If so, how can I
query for both the parent and the children documents?
Or should I not nest related documents and instead match the overview's ID
field with a field in the related document called "parentID"? If so, how do
I form my query to match documents whose parentID field matches the value
of a document's ID field?

--
Stephon Harris

*Enterprise Knowledge, LLC*
*Web: *http://www.enterprise-knowledge.com/
<http://www.google.com/url?q=http%3A%2F%2Fwww.enterprise-knowledge.com%2F&sa=D&sntz=1&usg=AFQjCNFDktFDhseOl_Pha6Pz3fIFaWolNg>
*E-mail:* [hidden email]/
<http://www.google.com/url?q=http%3A%2F%2Fwww.enterprise-knowledge.com%2F&sa=D&sntz=1&usg=AFQjCNFDktFDhseOl_Pha6Pz3fIFaWolNg>
*Cell:* 832-628-8352
Reply | Threaded
Open this post in threaded view
|

Fwd: Ingesting/Querying Documents with Nested/Related Documents and extracting Full-text

Stephon Harris
Following up on this to see if anyone has thoughts.

---------- Forwarded message ---------
From: Stephon Harris <[hidden email]>
Date: Wed, Nov 7, 2018 at 12:21 PM
Subject: Ingesting/Querying Documents with Nested/Related Documents and
extracting Full-text
To: <[hidden email]>


Hi,

I want to ingest a collection of documents along with extracted full-text
from PDFs using solr 'update/extract' endpoint to store the text in a field
called "fullText". I want to relate some documents to other documents so
when I query the "fullText" field  with user terms, solr returns the first
matching document with "contentType" field equal to "overview", and several
related documents with different values for "contentType" like this:


{
    "id":"1",
    "contentType":"overview",
    "fullText":"Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Etiam consectetur ipsum libero, at egestas ante laoreet nec. Aliquam sem
elit, rhoncus efficitur laoreet sodales, hendrerit eget mi. Nulla facilisis
tincidunt tortor vel placerat. Phasellus blandit velit eget semper
tristique. Maecenas convallis orci purus, ac scelerisque erat pulvinar id.
Donec semper enim id justo cursus, vitae bibendum magna interdum. Maecenas
eu laoreet nibh. Quisque magna massa, semper et lorem sed, volutpat
pulvinar quam. Quisque a urna et risus feugiat fermentum nec et orci.
Pellentesque ac neque sed tortor convallis finibus sit amet id purus. Sed
blandit eget ante et semper. Vivamus.",
    "product":"paper & goods"
},
{
    "id":"2",
    "contentType":"support",
    "title":"The latest support boards",
    "points":["Nulla facilisis tincidunt tortor vel placerat."," Phasellus
blandit velit eget semper tristique."],
    "product":"paper & goods",
    "parentID":"1"
},{
    "id":"3",
    "contentType":"boards",
    "title":"",
    "points":["Nulla facilisis tincidunt tortor vel placerat."," Phasellus
blandit velit eget semper tristique."],
    "product":"paper & goods",
    "parentID":"1"
}


I'm looking for any recommendations on ingesting and querying these
documents. Can I ingest these documents by nesting child documents in the
overview document and also extract full-text from a PDF? If so, how can I
query for both the parent and the children documents?
Or should I not nest related documents and instead match the overview's ID
field with a field in the related document called "parentID"? If so, how do
I form my query to match documents whose parentID field matches the value
of a document's ID field?

--
Stephon Harris

*Enterprise Knowledge, LLC*
*Web: *http://www.enterprise-knowledge.com/
<http://www.google.com/url?q=http%3A%2F%2Fwww.enterprise-knowledge.com%2F&sa=D&sntz=1&usg=AFQjCNFDktFDhseOl_Pha6Pz3fIFaWolNg>
*E-mail:* [hidden email]/
<http://www.google.com/url?q=http%3A%2F%2Fwww.enterprise-knowledge.com%2F&sa=D&sntz=1&usg=AFQjCNFDktFDhseOl_Pha6Pz3fIFaWolNg>
*Cell:* 832-628-8352



--
Stephon Harris

*Enterprise Knowledge, LLC*
*Web: *http://www.enterprise-knowledge.com/
<http://www.google.com/url?q=http%3A%2F%2Fwww.enterprise-knowledge.com%2F&sa=D&sntz=1&usg=AFQjCNFDktFDhseOl_Pha6Pz3fIFaWolNg>
*E-mail:* [hidden email]/
<http://www.google.com/url?q=http%3A%2F%2Fwww.enterprise-knowledge.com%2F&sa=D&sntz=1&usg=AFQjCNFDktFDhseOl_Pha6Pz3fIFaWolNg>
*Cell:* 832-628-8352
Reply | Threaded
Open this post in threaded view
|

Re: Ingesting/Querying Documents with Nested/Related Documents and extracting Full-text

Alexandre Rafalovitch
The extract handler is mostly there for prototyping purposes. It uses
Tika under the covers and you can use that yourself in the client.
Given your merge requirements, it would probably be best to have that
separated out.

In terms of structuring, you can do nested document, combined with
[child] transformer:
https://lucene.apache.org/solr/guide/7_5/transforming-result-documents.html#child-childdoctransformerfactory
. You do have to index them as a block together then and update as a
block as well.

Or you could do [subquery]:
https://lucene.apache.org/solr/guide/7_5/transforming-result-documents.html#subquery

Or perhaps something with graphs, if you are running in SolrCloud
mode: https://lucene.apache.org/solr/guide/7_5/graph-traversal.html

Regards,
   Alex.
On Thu, 8 Nov 2018 at 14:02, Stephon Harris
<[hidden email]> wrote:

>
> Following up on this to see if anyone has thoughts.
>
> ---------- Forwarded message ---------
> From: Stephon Harris <[hidden email]>
> Date: Wed, Nov 7, 2018 at 12:21 PM
> Subject: Ingesting/Querying Documents with Nested/Related Documents and
> extracting Full-text
> To: <[hidden email]>
>
>
> Hi,
>
> I want to ingest a collection of documents along with extracted full-text
> from PDFs using solr 'update/extract' endpoint to store the text in a field
> called "fullText". I want to relate some documents to other documents so
> when I query the "fullText" field  with user terms, solr returns the first
> matching document with "contentType" field equal to "overview", and several
> related documents with different values for "contentType" like this:
>
>
> {
>     "id":"1",
>     "contentType":"overview",
>     "fullText":"Lorem ipsum dolor sit amet, consectetur adipiscing elit.
> Etiam consectetur ipsum libero, at egestas ante laoreet nec. Aliquam sem
> elit, rhoncus efficitur laoreet sodales, hendrerit eget mi. Nulla facilisis
> tincidunt tortor vel placerat. Phasellus blandit velit eget semper
> tristique. Maecenas convallis orci purus, ac scelerisque erat pulvinar id.
> Donec semper enim id justo cursus, vitae bibendum magna interdum. Maecenas
> eu laoreet nibh. Quisque magna massa, semper et lorem sed, volutpat
> pulvinar quam. Quisque a urna et risus feugiat fermentum nec et orci.
> Pellentesque ac neque sed tortor convallis finibus sit amet id purus. Sed
> blandit eget ante et semper. Vivamus.",
>     "product":"paper & goods"
> },
> {
>     "id":"2",
>     "contentType":"support",
>     "title":"The latest support boards",
>     "points":["Nulla facilisis tincidunt tortor vel placerat."," Phasellus
> blandit velit eget semper tristique."],
>     "product":"paper & goods",
>     "parentID":"1"
> },{
>     "id":"3",
>     "contentType":"boards",
>     "title":"",
>     "points":["Nulla facilisis tincidunt tortor vel placerat."," Phasellus
> blandit velit eget semper tristique."],
>     "product":"paper & goods",
>     "parentID":"1"
> }
>
>
> I'm looking for any recommendations on ingesting and querying these
> documents. Can I ingest these documents by nesting child documents in the
> overview document and also extract full-text from a PDF? If so, how can I
> query for both the parent and the children documents?
> Or should I not nest related documents and instead match the overview's ID
> field with a field in the related document called "parentID"? If so, how do
> I form my query to match documents whose parentID field matches the value
> of a document's ID field?
>
> --
> Stephon Harris
>
> *Enterprise Knowledge, LLC*
> *Web: *http://www.enterprise-knowledge.com/
> <http://www.google.com/url?q=http%3A%2F%2Fwww.enterprise-knowledge.com%2F&sa=D&sntz=1&usg=AFQjCNFDktFDhseOl_Pha6Pz3fIFaWolNg>
> *E-mail:* [hidden email]/
> <http://www.google.com/url?q=http%3A%2F%2Fwww.enterprise-knowledge.com%2F&sa=D&sntz=1&usg=AFQjCNFDktFDhseOl_Pha6Pz3fIFaWolNg>
> *Cell:* 832-628-8352
>
>
>
> --
> Stephon Harris
>
> *Enterprise Knowledge, LLC*
> *Web: *http://www.enterprise-knowledge.com/
> <http://www.google.com/url?q=http%3A%2F%2Fwww.enterprise-knowledge.com%2F&sa=D&sntz=1&usg=AFQjCNFDktFDhseOl_Pha6Pz3fIFaWolNg>
> *E-mail:* [hidden email]/
> <http://www.google.com/url?q=http%3A%2F%2Fwww.enterprise-knowledge.com%2F&sa=D&sntz=1&usg=AFQjCNFDktFDhseOl_Pha6Pz3fIFaWolNg>
> *Cell:* 832-628-8352