Metadata and FullText, indexed at different times - looking for best approach

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Metadata and FullText, indexed at different times - looking for best approach

Alexandre Rafalovitch
Hello,

I have a database of metadata and I can inject it into SOLR with DIH
just fine. But then, I also have the documents to extract full text
from that I want to add to the same records as additional fields. I
think DIH allows to run Tika at the ingestion time, but I may not have
the full-text files at that point (they could arrive days later). I
can match the file to the metadata by a file name matching a field
name.

What is the best approach to do that staggered indexing with minimum
custom code? I guess my fallback position is a custom full-text
indexer agent that re-adds the metadata fields when the file is being
indexed. Is there anything better?

I am a newbie using v4.0alpha of SOLR (and loving it).

Thank you,
    Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)
Reply | Threaded
Open this post in threaded view
|

Re: Metadata and FullText, indexed at different times - looking for best approach

Erick Erickson
You've got a couple of choices. There's a new patch in town
https://issues.apache.org/jira/browse/SOLR-139
that allows you to update individual fields in a doc if (and only if)
all the fields in the original document were stored (actually, all the
non-copy fields).

So if you're storing (stored="true") all your metadata information, you can
just update the document when the  text becomes available assuming you
know the uniqueKey when you update.

Under the covers, this will find the old document, get all the fields, add the
new fields to it, and re-index the whole thing.

Otherwise, your fallback idea is a good one.

Best
Erick

On Sat, Jul 14, 2012 at 11:05 PM, Alexandre Rafalovitch
<[hidden email]> wrote:

> Hello,
>
> I have a database of metadata and I can inject it into SOLR with DIH
> just fine. But then, I also have the documents to extract full text
> from that I want to add to the same records as additional fields. I
> think DIH allows to run Tika at the ingestion time, but I may not have
> the full-text files at that point (they could arrive days later). I
> can match the file to the metadata by a file name matching a field
> name.
>
> What is the best approach to do that staggered indexing with minimum
> custom code? I guess my fallback position is a custom full-text
> indexer agent that re-adds the metadata fields when the file is being
> indexed. Is there anything better?
>
> I am a newbie using v4.0alpha of SOLR (and loving it).
>
> Thank you,
>     Alex.
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
Reply | Threaded
Open this post in threaded view
|

Re: Metadata and FullText, indexed at different times - looking for best approach

Alexandre Rafalovitch
Thank you,

I am already on 4alpha. Patch feels a little too unstable for my
needs/familiarity with the codes.

What about something around multiple cores? Could I have full-text
fields stored in a separate cores and somehow (again, minimum
hand-coding) do search against all those cores and get back combined
list of document IDs? Or would it making comparative ranking/sorting
impossible?

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Sun, Jul 15, 2012 at 12:08 PM, Erick Erickson
<[hidden email]> wrote:

> You've got a couple of choices. There's a new patch in town
> https://issues.apache.org/jira/browse/SOLR-139
> that allows you to update individual fields in a doc if (and only if)
> all the fields in the original document were stored (actually, all the
> non-copy fields).
>
> So if you're storing (stored="true") all your metadata information, you can
> just update the document when the  text becomes available assuming you
> know the uniqueKey when you update.
>
> Under the covers, this will find the old document, get all the fields, add the
> new fields to it, and re-index the whole thing.
>
> Otherwise, your fallback idea is a good one.
>
> Best
> Erick
>
> On Sat, Jul 14, 2012 at 11:05 PM, Alexandre Rafalovitch
> <[hidden email]> wrote:
>> Hello,
>>
>> I have a database of metadata and I can inject it into SOLR with DIH
>> just fine. But then, I also have the documents to extract full text
>> from that I want to add to the same records as additional fields. I
>> think DIH allows to run Tika at the ingestion time, but I may not have
>> the full-text files at that point (they could arrive days later). I
>> can match the file to the metadata by a file name matching a field
>> name.
>>
>> What is the best approach to do that staggered indexing with minimum
>> custom code? I guess my fallback position is a custom full-text
>> indexer agent that re-adds the metadata fields when the file is being
>> indexed. Is there anything better?
>>
>> I am a newbie using v4.0alpha of SOLR (and loving it).
>>
>> Thank you,
>>     Alex.
>> Personal blog: http://blog.outerthoughts.com/
>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all
>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)
Reply | Threaded
Open this post in threaded view
|

Re: Metadata and FullText, indexed at different times - looking for best approach

Erick Erickson
In that case, I think your best option is to re-index the entire document
when you have the text available, metadata and all. Which actually
begs the question whether you want to index the bare metadata at
all. Is it the use-case that the user actually gets value when there's no
text? If not, forget DIH and just index the metadata as a result of the
text becoming available.

Best
Erick

On Mon, Jul 16, 2012 at 1:43 PM, Alexandre Rafalovitch
<[hidden email]> wrote:

> Thank you,
>
> I am already on 4alpha. Patch feels a little too unstable for my
> needs/familiarity with the codes.
>
> What about something around multiple cores? Could I have full-text
> fields stored in a separate cores and somehow (again, minimum
> hand-coding) do search against all those cores and get back combined
> list of document IDs? Or would it making comparative ranking/sorting
> impossible?
>
> Regards,
>    Alex.
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
>
>
> On Sun, Jul 15, 2012 at 12:08 PM, Erick Erickson
> <[hidden email]> wrote:
>> You've got a couple of choices. There's a new patch in town
>> https://issues.apache.org/jira/browse/SOLR-139
>> that allows you to update individual fields in a doc if (and only if)
>> all the fields in the original document were stored (actually, all the
>> non-copy fields).
>>
>> So if you're storing (stored="true") all your metadata information, you can
>> just update the document when the  text becomes available assuming you
>> know the uniqueKey when you update.
>>
>> Under the covers, this will find the old document, get all the fields, add the
>> new fields to it, and re-index the whole thing.
>>
>> Otherwise, your fallback idea is a good one.
>>
>> Best
>> Erick
>>
>> On Sat, Jul 14, 2012 at 11:05 PM, Alexandre Rafalovitch
>> <[hidden email]> wrote:
>>> Hello,
>>>
>>> I have a database of metadata and I can inject it into SOLR with DIH
>>> just fine. But then, I also have the documents to extract full text
>>> from that I want to add to the same records as additional fields. I
>>> think DIH allows to run Tika at the ingestion time, but I may not have
>>> the full-text files at that point (they could arrive days later). I
>>> can match the file to the metadata by a file name matching a field
>>> name.
>>>
>>> What is the best approach to do that staggered indexing with minimum
>>> custom code? I guess my fallback position is a custom full-text
>>> indexer agent that re-adds the metadata fields when the file is being
>>> indexed. Is there anything better?
>>>
>>> I am a newbie using v4.0alpha of SOLR (and loving it).
>>>
>>> Thank you,
>>>     Alex.
>>> Personal blog: http://blog.outerthoughts.com/
>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>>> - Time is the quality of nature that keeps events from happening all
>>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>>> book)
Reply | Threaded
Open this post in threaded view
|

Re: Metadata and FullText, indexed at different times - looking for best approach

Alexandre Rafalovitch
Thank you,

Re-index does look like a real option then. I am looking now at
storing text/files in MongoDB or like and indexing into SOLR from
that. Initially, I was going to skip the DB part for as long as
possible.

Regarding the use case, yes it does make sense to have just metadata.
It is rich, curated metadata that works without files (several, each
in its own language). So, before files show up, the search is against
title/subject/etc. When the files show up, one by one, they get added
into index for additional/enhanced results.

Again, thank you for walking through this with me.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Tue, Jul 17, 2012 at 9:12 AM, Erick Erickson <[hidden email]> wrote:

> In that case, I think your best option is to re-index the entire document
> when you have the text available, metadata and all. Which actually
> begs the question whether you want to index the bare metadata at
> all. Is it the use-case that the user actually gets value when there's no
> text? If not, forget DIH and just index the metadata as a result of the
> text becoming available.
>
> Best
> Erick
>
> On Mon, Jul 16, 2012 at 1:43 PM, Alexandre Rafalovitch
> <[hidden email]> wrote:
>> Thank you,
>>
>> I am already on 4alpha. Patch feels a little too unstable for my
>> needs/familiarity with the codes.
>>
>> What about something around multiple cores? Could I have full-text
>> fields stored in a separate cores and somehow (again, minimum
>> hand-coding) do search against all those cores and get back combined
>> list of document IDs? Or would it making comparative ranking/sorting
>> impossible?
>>
>> Regards,
>>    Alex.
>> Personal blog: http://blog.outerthoughts.com/
>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all
>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)
>>
>>
>> On Sun, Jul 15, 2012 at 12:08 PM, Erick Erickson
>> <[hidden email]> wrote:
>>> You've got a couple of choices. There's a new patch in town
>>> https://issues.apache.org/jira/browse/SOLR-139
>>> that allows you to update individual fields in a doc if (and only if)
>>> all the fields in the original document were stored (actually, all the
>>> non-copy fields).
>>>
>>> So if you're storing (stored="true") all your metadata information, you can
>>> just update the document when the  text becomes available assuming you
>>> know the uniqueKey when you update.
>>>
>>> Under the covers, this will find the old document, get all the fields, add the
>>> new fields to it, and re-index the whole thing.
>>>
>>> Otherwise, your fallback idea is a good one.
>>>
>>> Best
>>> Erick
>>>
>>> On Sat, Jul 14, 2012 at 11:05 PM, Alexandre Rafalovitch
>>> <[hidden email]> wrote:
>>>> Hello,
>>>>
>>>> I have a database of metadata and I can inject it into SOLR with DIH
>>>> just fine. But then, I also have the documents to extract full text
>>>> from that I want to add to the same records as additional fields. I
>>>> think DIH allows to run Tika at the ingestion time, but I may not have
>>>> the full-text files at that point (they could arrive days later). I
>>>> can match the file to the metadata by a file name matching a field
>>>> name.
>>>>
>>>> What is the best approach to do that staggered indexing with minimum
>>>> custom code? I guess my fallback position is a custom full-text
>>>> indexer agent that re-adds the metadata fields when the file is being
>>>> indexed. Is there anything better?
>>>>
>>>> I am a newbie using v4.0alpha of SOLR (and loving it).
>>>>
>>>> Thank you,
>>>>     Alex.
>>>> Personal blog: http://blog.outerthoughts.com/
>>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>>>> - Time is the quality of nature that keeps events from happening all
>>>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>>>> book)