Passing Metadata from an RTF-file via TIKA to SOLR ...

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Passing Metadata from an RTF-file via TIKA to SOLR ...

Jan.Christopher.Schluchtmann-EXT
Hi there!
I am quite new to Lucene/Solr/Tika, etc., so I would appreciate you help
concerning the following matter.


I have a RTF-document, that I want to index in Solr, using Tika.
The RTF-indexing works in general, but since I changed the Solr-schema,
the indexer complains about missing mandatory fields, like "module-id".
The rtf-file is generated by me and I added the metadata-fields to the
RTF-document in the "userprops"-section of the RTF-file (see below) -- so
Tika should be able to read it and to provide it.

The problem is: I don't know HOW or WHERE Tika provides this metadata, so
I don't know how to access it. As a result, I don't know how I can map it
to the respective Solr-fields, like "module-id", that are mandatory in my
Solr-schema.

Can someone give me a hint, please?
I am running out of ideas here ... :-/


<RTF-file>

{\rtf1\fbidis\ansi\ansicpg1252\deff0\deflang1031{\fonttbl{\f0\fnil\fcharset0
Arial;}}
{\colortbl ;\red0\green0\blue0;}
        {\userprops
                {\propname module-id}\proptype30{\staticval 000ba8a6}
        }
}


Mit freundlichen Grüßen/ With kind regards

Jan Schluchtmann
Systems Engineering Cluster Instruments
VW Group
Continental Automotive GmbH
Division Interior
ID S3 RM
VDO-Strasse 1, 64832 Babenhausen, Germany

Telefon/Phone: +49 6073 12-4346
Telefax: +49 6073 12-79-4346
Reply | Threaded
Open this post in threaded view
|

Metadata passed with CURL (via literal) is not recognized by SOLR ...?

Jan.Christopher.Schluchtmann-EXT
Hi!
I am trying to index RTF-files by uploading them to the Solr-Server with
CURL.
I am trying to pass the required metadata by the
"literal.<key>=<value>"-statement.


The "id" and the "module-id" are mandatory in my schema.
The "id" is recognized correctly, as one can see in the Solr-response
"doc=48a0xxx" ... but the "module-id" seems to be neglected.

Why is that?


Thanks in advance!!!



Here is the CURL-command I pass via Windows 10 Powershell:

SOLR-REQUEST:

curl.exe "
http://localhost:8983/solr/ContiReqManCore/update/extract/?commit=true&literal.id=48a04d8e5da651c5-000ba8a6-1&literal.project-id=000d8181&literal.project-name=FPK_Medium_19S1&literal.project-path=%2FFPK_Medium_19S1&literal.module-id=000ba8a6&literal.module-name=PVVTS_Functional_FPK_Medium_19S1&literal.module-path=%2FFPK_Medium_19S1%2F02_Quality%2F10_Verification-Validation%2FPVVTS_Functional_FPK_Medium_19S1&literal.module-prefix=PVVTS_Funct_&literal.object-id=1
" -F "object-ole=@D:\(...)\PVVTS_Funct_263.rtf"


SOLR-RESPONSE:

{
  "responseHeader":{
    "status":400,
    "QTime":7},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"[doc=48a04d8e5da651c5-000ba8a6-1] missing required field:
module-id",
    "code":400}
}


Mit freundlichen Grüßen/ With kind regards

Jan Schluchtmann
Systems Engineering Cluster Instruments
VW Group
Continental Automotive GmbH
Division Interior
ID S3 RM
VDO-Strasse 1, 64832 Babenhausen, Germany

Telefon/Phone: +49 6073 12-4346
Telefax: +49 6073 12-79-4346
Reply | Threaded
Open this post in threaded view
|

Re: Metadata passed with CURL (via literal) is not recognized by SOLR ...?

Jan.Christopher.Schluchtmann-EXT
Ok, I found the solution myself.

Reason for this behaviour was the "lowernames = true"-configuration of the
Tika-requestHandler, that transformed the "module-id" to "module_id".
I added a fitting copyField to my schema and it seems to work now.


Maybe, this information is useful for someone ... of course, it is
mentioned the manual, but finding it is the problem, if you don't know,
what you are looking for. ;)


Regards
Jan



Mit freundlichen Grüßen/ With kind regards

Jan Schluchtmann
Systems Engineering Cluster Instruments
VW Group
Continental Automotive GmbH
Division Interior
ID S3 RM
VDO-Strasse 1, 64832 Babenhausen, Germany

Telefon/Phone: +49 6073 12-4346
Telefax: +49 6073 12-79-4346



Von:    [hidden email]
An:     [hidden email],
Datum:  05.12.2017 11:02
Betreff:        Metadata passed with CURL (via literal) is not recognized
by SOLR ...?



Hi!
I am trying to index RTF-files by uploading them to the Solr-Server with
CURL.
I am trying to pass the required metadata by the
"literal.<key>=<value>"-statement.


The "id" and the "module-id" are mandatory in my schema.
The "id" is recognized correctly, as one can see in the Solr-response
"doc=48a0xxx" ... but the "module-id" seems to be neglected.

Why is that?


Thanks in advance!!!



Here is the CURL-command I pass via Windows 10 Powershell:

SOLR-REQUEST:

curl.exe "
http://localhost:8983/solr/ContiReqManCore/update/extract/?commit=true&literal.id=48a04d8e5da651c5-000ba8a6-1&literal.project-id=000d8181&literal.project-name=FPK_Medium_19S1&literal.project-path=%2FFPK_Medium_19S1&literal.module-id=000ba8a6&literal.module-name=PVVTS_Functional_FPK_Medium_19S1&literal.module-path=%2FFPK_Medium_19S1%2F02_Quality%2F10_Verification-Validation%2FPVVTS_Functional_FPK_Medium_19S1&literal.module-prefix=PVVTS_Funct_&literal.object-id=1

" -F "object-ole=@D:\(...)\PVVTS_Funct_263.rtf"


SOLR-RESPONSE:

{
  "responseHeader":{
    "status":400,
    "QTime":7},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"[doc=48a04d8e5da651c5-000ba8a6-1] missing required field:
module-id",
    "code":400}
}


Mit freundlichen Grüßen/ With kind regards

Jan Schluchtmann
Systems Engineering Cluster Instruments
VW Group
Continental Automotive GmbH
Division Interior
ID S3 RM
VDO-Strasse 1, 64832 Babenhausen, Germany

Telefon/Phone: +49 6073 12-4346
Telefax: +49 6073 12-79-4346