Solr indexing for unstructured data

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr indexing for unstructured data

amritpattnaik
Hi ,
I am a newbie in Solr. I have a scenario wherein the pdf documents with
unstructured data have been parsed as text and kept in a separate directory.

Now once I build a collection and do indexing using "bin/post -c collection
name document name", the document gets indexed and I am able to retrieve
the result. But it is a schemaless mode, I add fields to the managed-schema
of collection.

If I use bin/post command mentioned above, it does not return the added
fields in schema in query result. So I tried indexing using curl command
wherein I explicitly mention the field name value in the document sent for
indexing. The required fields show up in query result but if I do a keyword
based search, the document added through curl command don't show up.

Would appreciate pointers/ help as I have been stuck on this issue for long.

Regards,
Amrit

--
With Regards,

Amrit Pattnaik
Reply | Threaded
Open this post in threaded view
|

Re: Solr indexing for unstructured data

Alexandre Rafalovitch
In Admin UI, there is schema browsing screen:
https://lucene.apache.org/solr/guide/8_1/schema-browser-screen.html
That shows you all the fields you have, their configuration and their
(tokenized) indexed content.

This seems to be a good midpoint between indexing and querying. So, I
would check whether the field you expect (and the fields you did not
expect) are there. If they are, focus on querying. If they are not,
focus on indexing.

This is a generic advice, because the question is not really clear.
Specifically:
1) "PDF parsed as text" "and I index that file" - what does that file
look like (content type)
2) "I index with bin/post" "I am able to retrieve results"  vs "I use
bin/post above" "it does not return fields in query". I can't tell the
difference between those two sequences, if you are indexing the same
file with the same command, you should get the same results.

Hope that helps.

Regards,
   Alex.

On Thu, 22 Aug 2019 at 09:44, amrit pattnaik <[hidden email]> wrote:

>
> Hi ,
> I am a newbie in Solr. I have a scenario wherein the pdf documents with
> unstructured data have been parsed as text and kept in a separate directory.
>
> Now once I build a collection and do indexing using "bin/post -c collection
> name document name", the document gets indexed and I am able to retrieve
> the result. But it is a schemaless mode, I add fields to the managed-schema
> of collection.
>
> If I use bin/post command mentioned above, it does not return the added
> fields in schema in query result. So I tried indexing using curl command
> wherein I explicitly mention the field name value in the document sent for
> indexing. The required fields show up in query result but if I do a keyword
> based search, the document added through curl command don't show up.
>
> Would appreciate pointers/ help as I have been stuck on this issue for long.
>
> Regards,
> Amrit
>
> --
> With Regards,
>
> Amrit Pattnaik