Extracting XMP metadata from PDF for indexing Nutch 1.15

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Extracting XMP metadata from PDF for indexing Nutch 1.15

Gilvary, Joseph
Happy New Year,

I've searched the archives and the web as best I can, tinkered with nutch-site.xml and schema.xml, but I can't get XMP metadata that's in the parse metadata into the Solr (7.6) index.

I want to index stuff like:

xmp:CreatorTool=PScript5.dll Version 5.2.2
xmpTPg:NPages=23

I get the pdf:docinfo:created, pdf:docinfo:modified, etc. fine, but swapping out ':' for '_' isn't working for the xmp stuff.

Is there more documentation on having Nutch get what Tika sees into what Solr will see?

Any help appreciated.

Thanks,

Joe
Reply | Threaded
Open this post in threaded view
|

RE: Extracting XMP metadata from PDF for indexing Nutch 1.15

Markus Jelsma-2
Hello Joseph,

> Is there more documentation on having Nutch get what Tika sees into what Solr will see?

No, but i believe you would want to checkout the parsechecker and indexchecker tools. These tools display what Tika sees and what will be sent to Solr.

Regards,
Markus
 
-----Original message-----

> From:Gilvary, Joseph <[hidden email]>
> Sent: Tuesday 31st December 2019 14:19
> To: [hidden email]
> Subject: Extracting XMP metadata from PDF for indexing Nutch 1.15
>
> Happy New Year,
>
> I've searched the archives and the web as best I can, tinkered with nutch-site.xml and schema.xml, but I can't get XMP metadata that's in the parse metadata into the Solr (7.6) index.
>
> I want to index stuff like:
>
> xmp:CreatorTool=PScript5.dll Version 5.2.2
> xmpTPg:NPages=23
>
> I get the pdf:docinfo:created, pdf:docinfo:modified, etc. fine, but swapping out ':' for '_' isn't working for the xmp stuff.
>
> Is there more documentation on having Nutch get what Tika sees into what Solr will see?
>
> Any help appreciated.
>
> Thanks,
>
> Joe
>
Reply | Threaded
Open this post in threaded view
|

RE: Extracting XMP metadata from PDF for indexing Nutch 1.15

Gilvary, Joseph
Thanks, Markus,

Those are the tools I've been using to debug because it's quicker than reindexing even a test collection in Solr. So parsechecker shows that these fields are in the parse metadata, but I can't figure out how to get them into the index. The pdf:docinfo:fields will index as pdf_docinfo_fields, but the other namespaces using ':' aren't making it through and I'm at a loss.

Nutch schema.xml:

<field name="pdf_docinfo_created" type="pdates"/>
<field name="xmpTPg_NPages" type="int" indexed="true" stored="true"/>

nutch-site.xml:

  <property>
    <name>index.parse.md</name>
    <value>description,keywords,dcterms.created,dcterms.modified,dcterms.subject,pdf:docinfo:created,pdf:docinfo:modified,pdf:docinfo:title,xmp:CreatorTool,xmpTPg:NPages </value>
  </property>


Parsechecker sees the values for the xmp stuff:

Parse Metadata: date=2011-04-27T18:36:58Z pdf:PDFVersion=1.4 pdf:docinfo:title=Test File xmp:CreatorTool=PScript5.dll Version 5.2.2 access_permission:blah_blah_blah xmpTPg:NPages=23 access_permission:can_modify=true pdf:docinfo:producer=Acrobat Distiller 7.0.5 (Windows) pdf:docinfo:created=2011-04-27T18:33:06Z


Indexchecker doesn't:

fetching: http://127.0.01/test.pdf
robots.txt whitelist not configured.
parsing: http://127.0.01/test.pdf
pdf:docinfo:title :     Test File
tstamp :        Tue Dec 31 11:23:28 EST 2019
pdf:docinfo:modified :  2011-04-27T18:36:58Z
pdf:docinfo:created :   2011-04-27T18:33:06Z


The Dublin Core values don't use colon ':' but dot '.' and they show up fine. There are embedded spaces in some of the xmp values, but the pdf:docinfo:title has that, too, it shows up in the indexchecker output. I'm wondering if there's anything special about the pdf:docinfo that isn't generalized or is somehow configurable for generalization to other namespaces.

 Thanks,

 Joe

-----Original Message-----
From: Markus Jelsma <[hidden email]>
Sent: Tuesday, December 31, 2019 8:30 AM
To: [hidden email]
Subject: RE: Extracting XMP metadata from PDF for indexing Nutch 1.15

Hello Joseph,

> Is there more documentation on having Nutch get what Tika sees into what Solr will see?

No, but i believe you would want to checkout the parsechecker and indexchecker tools. These tools display what Tika sees and what will be sent to Solr.

Regards,
Markus
 
-----Original message-----

> From:Gilvary, Joseph <[hidden email]>
> Sent: Tuesday 31st December 2019 14:19
> To: [hidden email]
> Subject: Extracting XMP metadata from PDF for indexing Nutch 1.15
>
> Happy New Year,
>
> I've searched the archives and the web as best I can, tinkered with nutch-site.xml and schema.xml, but I can't get XMP metadata that's in the parse metadata into the Solr (7.6) index.
>
> I want to index stuff like:
>
> xmp:CreatorTool=PScript5.dll Version 5.2.2
> xmpTPg:NPages=23
>
> I get the pdf:docinfo:created, pdf:docinfo:modified, etc. fine, but swapping out ':' for '_' isn't working for the xmp stuff.
>
> Is there more documentation on having Nutch get what Tika sees into what Solr will see?
>
> Any help appreciated.
>
> Thanks,
>
> Joe
>
Reply | Threaded
Open this post in threaded view
|

Re: Extracting XMP metadata from PDF for indexing Nutch 1.15

Sebastian Nagel-2
Hi Joseph,

this could be related to
   https://issues.apache.org/jira/browse/NUTCH-2525
caused by not-all-lowercase meta keys.

I'm happy to check whether the attached patch fixes your problem
when I'm back from holidays in a few days.

Best,
Sebastian

On 12/31/19 5:43 PM, Gilvary, Joseph wrote:

> Thanks, Markus,
>
> Those are the tools I've been using to debug because it's quicker than reindexing even a test collection in Solr. So parsechecker shows that these fields are in the parse metadata, but I can't figure out how to get them into the index. The pdf:docinfo:fields will index as pdf_docinfo_fields, but the other namespaces using ':' aren't making it through and I'm at a loss.
>
> Nutch schema.xml:
>
> <field name="pdf_docinfo_created" type="pdates"/>
> <field name="xmpTPg_NPages" type="int" indexed="true" stored="true"/>
>
> nutch-site.xml:
>
>   <property>
>     <name>index.parse.md</name>
>     <value>description,keywords,dcterms.created,dcterms.modified,dcterms.subject,pdf:docinfo:created,pdf:docinfo:modified,pdf:docinfo:title,xmp:CreatorTool,xmpTPg:NPages </value>
>   </property>
>
>
> Parsechecker sees the values for the xmp stuff:
>
> Parse Metadata: date=2011-04-27T18:36:58Z pdf:PDFVersion=1.4 pdf:docinfo:title=Test File xmp:CreatorTool=PScript5.dll Version 5.2.2 access_permission:blah_blah_blah xmpTPg:NPages=23 access_permission:can_modify=true pdf:docinfo:producer=Acrobat Distiller 7.0.5 (Windows) pdf:docinfo:created=2011-04-27T18:33:06Z
>
>
> Indexchecker doesn't:
>
> fetching: http://127.0.01/test.pdf
> robots.txt whitelist not configured.
> parsing: http://127.0.01/test.pdf
> pdf:docinfo:title :     Test File
> tstamp :        Tue Dec 31 11:23:28 EST 2019
> pdf:docinfo:modified :  2011-04-27T18:36:58Z
> pdf:docinfo:created :   2011-04-27T18:33:06Z
>
>
> The Dublin Core values don't use colon ':' but dot '.' and they show up fine. There are embedded spaces in some of the xmp values, but the pdf:docinfo:title has that, too, it shows up in the indexchecker output. I'm wondering if there's anything special about the pdf:docinfo that isn't generalized or is somehow configurable for generalization to other namespaces.
>
>  Thanks,
>
>  Joe
>
> -----Original Message-----
> From: Markus Jelsma <[hidden email]>
> Sent: Tuesday, December 31, 2019 8:30 AM
> To: [hidden email]
> Subject: RE: Extracting XMP metadata from PDF for indexing Nutch 1.15
>
> Hello Joseph,
>
>> Is there more documentation on having Nutch get what Tika sees into what Solr will see?
>
> No, but i believe you would want to checkout the parsechecker and indexchecker tools. These tools display what Tika sees and what will be sent to Solr.
>
> Regards,
> Markus
>  
> -----Original message-----
>> From:Gilvary, Joseph <[hidden email]>
>> Sent: Tuesday 31st December 2019 14:19
>> To: [hidden email]
>> Subject: Extracting XMP metadata from PDF for indexing Nutch 1.15
>>
>> Happy New Year,
>>
>> I've searched the archives and the web as best I can, tinkered with nutch-site.xml and schema.xml, but I can't get XMP metadata that's in the parse metadata into the Solr (7.6) index.
>>
>> I want to index stuff like:
>>
>> xmp:CreatorTool=PScript5.dll Version 5.2.2
>> xmpTPg:NPages=23
>>
>> I get the pdf:docinfo:created, pdf:docinfo:modified, etc. fine, but swapping out ':' for '_' isn't working for the xmp stuff.
>>
>> Is there more documentation on having Nutch get what Tika sees into what Solr will see?
>>
>> Any help appreciated.
>>
>> Thanks,
>>
>> Joe
>>

Reply | Threaded
Open this post in threaded view
|

RE: Extracting XMP metadata from PDF for indexing Nutch 1.15

Gilvary, Joseph
Happy New Year, Sebastian,

Thank you. That looks promising. Hope you enjoy the holiday!

 Joe

-----Original Message-----
From: Sebastian Nagel <[hidden email]>
Sent: Thursday, January 2, 2020 7:42 AM
To: [hidden email]
Subject: Re: Extracting XMP metadata from PDF for indexing Nutch 1.15

Hi Joseph,

this could be related to
   https://gcc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNUTCH-2525&amp;data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9cbe85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7C637135657390453013&amp;sdata=ze1ggDtnCA5%2BuAu6LQFFSZbu24U%2BY3WRHvvD%2BsdriT4%3D&amp;reserved=0
caused by not-all-lowercase meta keys.

I'm happy to check whether the attached patch fixes your problem when I'm back from holidays in a few days.

Best,
Sebastian

On 12/31/19 5:43 PM, Gilvary, Joseph wrote:

> Thanks, Markus,
>
> Those are the tools I've been using to debug because it's quicker than reindexing even a test collection in Solr. So parsechecker shows that these fields are in the parse metadata, but I can't figure out how to get them into the index. The pdf:docinfo:fields will index as pdf_docinfo_fields, but the other namespaces using ':' aren't making it through and I'm at a loss.
>
> Nutch schema.xml:
>
> <field name="pdf_docinfo_created" type="pdates"/> <field
> name="xmpTPg_NPages" type="int" indexed="true" stored="true"/>
>
> nutch-site.xml:
>
>   <property>
>     <name>index.parse.md</name>
>     <value>description,keywords,dcterms.created,dcterms.modified,dcterms.subject,pdf:docinfo:created,pdf:docinfo:modified,pdf:docinfo:title,xmp:CreatorTool,xmpTPg:NPages </value>
>   </property>
>
>
> Parsechecker sees the values for the xmp stuff:
>
> Parse Metadata: date=2011-04-27T18:36:58Z pdf:PDFVersion=1.4
> pdf:docinfo:title=Test File xmp:CreatorTool=PScript5.dll Version 5.2.2
> access_permission:blah_blah_blah xmpTPg:NPages=23
> access_permission:can_modify=true pdf:docinfo:producer=Acrobat
> Distiller 7.0.5 (Windows) pdf:docinfo:created=2011-04-27T18:33:06Z
>
>
> Indexchecker doesn't:
>
> fetching:
> https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0
> .01%2Ftest.pdf&amp;data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9c
> be85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7
> C637135657390462972&amp;sdata=Wpl1PTe8bcX%2BGZR6W2c5totgtYMOatod6nVi%2
> FAOBXXM%3D&amp;reserved=0
> robots.txt whitelist not configured.
> parsing: https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.01%2Ftest.pdf&amp;data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9cbe85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7C637135657390462972&amp;sdata=Wpl1PTe8bcX%2BGZR6W2c5totgtYMOatod6nVi%2FAOBXXM%3D&amp;reserved=0
> pdf:docinfo:title :     Test File
> tstamp :        Tue Dec 31 11:23:28 EST 2019
> pdf:docinfo:modified :  2011-04-27T18:36:58Z
> pdf:docinfo:created :   2011-04-27T18:33:06Z
>
>
> The Dublin Core values don't use colon ':' but dot '.' and they show up fine. There are embedded spaces in some of the xmp values, but the pdf:docinfo:title has that, too, it shows up in the indexchecker output. I'm wondering if there's anything special about the pdf:docinfo that isn't generalized or is somehow configurable for generalization to other namespaces.
>
>  Thanks,
>
>  Joe
>
> -----Original Message-----
> From: Markus Jelsma <[hidden email]>
> Sent: Tuesday, December 31, 2019 8:30 AM
> To: [hidden email]
> Subject: RE: Extracting XMP metadata from PDF for indexing Nutch 1.15
>
> Hello Joseph,
>
>> Is there more documentation on having Nutch get what Tika sees into what Solr will see?
>
> No, but i believe you would want to checkout the parsechecker and indexchecker tools. These tools display what Tika sees and what will be sent to Solr.
>
> Regards,
> Markus
>  
> -----Original message-----
>> From:Gilvary, Joseph <[hidden email]>
>> Sent: Tuesday 31st December 2019 14:19
>> To: [hidden email]
>> Subject: Extracting XMP metadata from PDF for indexing Nutch 1.15
>>
>> Happy New Year,
>>
>> I've searched the archives and the web as best I can, tinkered with nutch-site.xml and schema.xml, but I can't get XMP metadata that's in the parse metadata into the Solr (7.6) index.
>>
>> I want to index stuff like:
>>
>> xmp:CreatorTool=PScript5.dll Version 5.2.2
>> xmpTPg:NPages=23
>>
>> I get the pdf:docinfo:created, pdf:docinfo:modified, etc. fine, but swapping out ':' for '_' isn't working for the xmp stuff.
>>
>> Is there more documentation on having Nutch get what Tika sees into what Solr will see?
>>
>> Any help appreciated.
>>
>> Thanks,
>>
>> Joe
>>

Reply | Threaded
Open this post in threaded view
|

Re: Extracting XMP metadata from PDF for indexing Nutch 1.15

Sebastian Nagel-2
Hi Joseph,

sorry for the late reply. Anyway: the patch for NUTCH-2525
fixes your problem. See also my comments in
   https://issues.apache.org/jira/browse/NUTCH-2525

Thanks,
Sebastian


On 1/2/20 2:55 PM, Gilvary, Joseph wrote:

> Happy New Year, Sebastian,
>
> Thank you. That looks promising. Hope you enjoy the holiday!
>
>  Joe
>
> -----Original Message-----
> From: Sebastian Nagel <[hidden email]>
> Sent: Thursday, January 2, 2020 7:42 AM
> To: [hidden email]
> Subject: Re: Extracting XMP metadata from PDF for indexing Nutch 1.15
>
> Hi Joseph,
>
> this could be related to
>    https://gcc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNUTCH-2525&amp;data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9cbe85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7C637135657390453013&amp;sdata=ze1ggDtnCA5%2BuAu6LQFFSZbu24U%2BY3WRHvvD%2BsdriT4%3D&amp;reserved=0
> caused by not-all-lowercase meta keys.
>
> I'm happy to check whether the attached patch fixes your problem when I'm back from holidays in a few days.
>
> Best,
> Sebastian
>
> On 12/31/19 5:43 PM, Gilvary, Joseph wrote:
>> Thanks, Markus,
>>
>> Those are the tools I've been using to debug because it's quicker than reindexing even a test collection in Solr. So parsechecker shows that these fields are in the parse metadata, but I can't figure out how to get them into the index. The pdf:docinfo:fields will index as pdf_docinfo_fields, but the other namespaces using ':' aren't making it through and I'm at a loss.
>>
>> Nutch schema.xml:
>>
>> <field name="pdf_docinfo_created" type="pdates"/> <field
>> name="xmpTPg_NPages" type="int" indexed="true" stored="true"/>
>>
>> nutch-site.xml:
>>
>>   <property>
>>     <name>index.parse.md</name>
>>     <value>description,keywords,dcterms.created,dcterms.modified,dcterms.subject,pdf:docinfo:created,pdf:docinfo:modified,pdf:docinfo:title,xmp:CreatorTool,xmpTPg:NPages </value>
>>   </property>
>>
>>
>> Parsechecker sees the values for the xmp stuff:
>>
>> Parse Metadata: date=2011-04-27T18:36:58Z pdf:PDFVersion=1.4
>> pdf:docinfo:title=Test File xmp:CreatorTool=PScript5.dll Version 5.2.2
>> access_permission:blah_blah_blah xmpTPg:NPages=23
>> access_permission:can_modify=true pdf:docinfo:producer=Acrobat
>> Distiller 7.0.5 (Windows) pdf:docinfo:created=2011-04-27T18:33:06Z
>>
>>
>> Indexchecker doesn't:
>>
>> fetching:
>> https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0
>> .01%2Ftest.pdf&amp;data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9c
>> be85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7
>> C637135657390462972&amp;sdata=Wpl1PTe8bcX%2BGZR6W2c5totgtYMOatod6nVi%2
>> FAOBXXM%3D&amp;reserved=0
>> robots.txt whitelist not configured.
>> parsing: https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.01%2Ftest.pdf&amp;data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9cbe85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7C637135657390462972&amp;sdata=Wpl1PTe8bcX%2BGZR6W2c5totgtYMOatod6nVi%2FAOBXXM%3D&amp;reserved=0
>> pdf:docinfo:title :     Test File
>> tstamp :        Tue Dec 31 11:23:28 EST 2019
>> pdf:docinfo:modified :  2011-04-27T18:36:58Z
>> pdf:docinfo:created :   2011-04-27T18:33:06Z
>>
>>
>> The Dublin Core values don't use colon ':' but dot '.' and they show up fine. There are embedded spaces in some of the xmp values, but the pdf:docinfo:title has that, too, it shows up in the indexchecker output. I'm wondering if there's anything special about the pdf:docinfo that isn't generalized or is somehow configurable for generalization to other namespaces.
>>
>>  Thanks,
>>
>>  Joe
>>
>> -----Original Message-----
>> From: Markus Jelsma <[hidden email]>
>> Sent: Tuesday, December 31, 2019 8:30 AM
>> To: [hidden email]
>> Subject: RE: Extracting XMP metadata from PDF for indexing Nutch 1.15
>>
>> Hello Joseph,
>>
>>> Is there more documentation on having Nutch get what Tika sees into what Solr will see?
>>
>> No, but i believe you would want to checkout the parsechecker and indexchecker tools. These tools display what Tika sees and what will be sent to Solr.
>>
>> Regards,
>> Markus
>>  
>> -----Original message-----
>>> From:Gilvary, Joseph <[hidden email]>
>>> Sent: Tuesday 31st December 2019 14:19
>>> To: [hidden email]
>>> Subject: Extracting XMP metadata from PDF for indexing Nutch 1.15
>>>
>>> Happy New Year,
>>>
>>> I've searched the archives and the web as best I can, tinkered with nutch-site.xml and schema.xml, but I can't get XMP metadata that's in the parse metadata into the Solr (7.6) index.
>>>
>>> I want to index stuff like:
>>>
>>> xmp:CreatorTool=PScript5.dll Version 5.2.2
>>> xmpTPg:NPages=23
>>>
>>> I get the pdf:docinfo:created, pdf:docinfo:modified, etc. fine, but swapping out ':' for '_' isn't working for the xmp stuff.
>>>
>>> Is there more documentation on having Nutch get what Tika sees into what Solr will see?
>>>
>>> Any help appreciated.
>>>
>>> Thanks,
>>>
>>> Joe
>>>
>

Reply | Threaded
Open this post in threaded view
|

RE: Extracting XMP metadata from PDF for indexing Nutch 1.15

Gilvary, Joseph
Thank you, Sebastian, for updated patch and pointer to the discussion.

 Joe

-----Original Message-----
From: Sebastian Nagel <[hidden email]>
Sent: Wednesday, January 15, 2020 5:25 AM
To: [hidden email]
Subject: Re: Extracting XMP metadata from PDF for indexing Nutch 1.15

Hi Joseph,

sorry for the late reply. Anyway: the patch for NUTCH-2525 fixes your problem. See also my comments in
   https://gcc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNUTCH-2525&amp;data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7C904ced77d81c4a38c85d08d799a52595%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7C637146806906439440&amp;sdata=5NbDweVESvdJcZHmjJJvp7m3RLHL%2BxREGT40Sn%2B8YOQ%3D&amp;reserved=0

Thanks,
Sebastian


On 1/2/20 2:55 PM, Gilvary, Joseph wrote:

> Happy New Year, Sebastian,
>
> Thank you. That looks promising. Hope you enjoy the holiday!
>
>  Joe
>
> -----Original Message-----
> From: Sebastian Nagel <[hidden email]>
> Sent: Thursday, January 2, 2020 7:42 AM
> To: [hidden email]
> Subject: Re: Extracting XMP metadata from PDF for indexing Nutch 1.15
>
> Hi Joseph,
>
> this could be related to
>    
> https://gcc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissu
> es.apache.org%2Fjira%2Fbrowse%2FNUTCH-2525&amp;data=02%7C01%7CJoseph.G
> ilvary%40uspto.gov%7C904ced77d81c4a38c85d08d799a52595%7Cff4abfe983b540
> 268b8ffa69a1cad0b8%7C1%7C1%7C637146806906439440&amp;sdata=5NbDweVESvdJ
> cZHmjJJvp7m3RLHL%2BxREGT40Sn%2B8YOQ%3D&amp;reserved=0
> caused by not-all-lowercase meta keys.
>
> I'm happy to check whether the attached patch fixes your problem when I'm back from holidays in a few days.
>
> Best,
> Sebastian
>
> On 12/31/19 5:43 PM, Gilvary, Joseph wrote:
>> Thanks, Markus,
>>
>> Those are the tools I've been using to debug because it's quicker than reindexing even a test collection in Solr. So parsechecker shows that these fields are in the parse metadata, but I can't figure out how to get them into the index. The pdf:docinfo:fields will index as pdf_docinfo_fields, but the other namespaces using ':' aren't making it through and I'm at a loss.
>>
>> Nutch schema.xml:
>>
>> <field name="pdf_docinfo_created" type="pdates"/> <field
>> name="xmpTPg_NPages" type="int" indexed="true" stored="true"/>
>>
>> nutch-site.xml:
>>
>>   <property>
>>     <name>index.parse.md</name>
>>     <value>description,keywords,dcterms.created,dcterms.modified,dcterms.subject,pdf:docinfo:created,pdf:docinfo:modified,pdf:docinfo:title,xmp:CreatorTool,xmpTPg:NPages </value>
>>   </property>
>>
>>
>> Parsechecker sees the values for the xmp stuff:
>>
>> Parse Metadata: date=2011-04-27T18:36:58Z pdf:PDFVersion=1.4
>> pdf:docinfo:title=Test File xmp:CreatorTool=PScript5.dll Version
>> 5.2.2 access_permission:blah_blah_blah xmpTPg:NPages=23
>> access_permission:can_modify=true pdf:docinfo:producer=Acrobat
>> Distiller 7.0.5 (Windows) pdf:docinfo:created=2011-04-27T18:33:06Z
>>
>>
>> Indexchecker doesn't:
>>
>> fetching:
>> https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.
>> 0
>> .01%2Ftest.pdf&amp;data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7Cbbc0e9
>> c
>> be85346e96d9408d78f8132f9%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%
>> 7
>> C637135657390462972&amp;sdata=Wpl1PTe8bcX%2BGZR6W2c5totgtYMOatod6nVi%
>> 2
>> FAOBXXM%3D&amp;reserved=0
>> robots.txt whitelist not configured.
>> parsing: https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.01%2Ftest.pdf&amp;data=02%7C01%7CJoseph.Gilvary%40uspto.gov%7C904ced77d81c4a38c85d08d799a52595%7Cff4abfe983b540268b8ffa69a1cad0b8%7C1%7C1%7C637146806906439440&amp;sdata=MYOolP4hbljGmHBqDu3UVf%2B%2BNU2zX4VgmGPPQuqRRgc%3D&amp;reserved=0
>> pdf:docinfo:title :     Test File
>> tstamp :        Tue Dec 31 11:23:28 EST 2019
>> pdf:docinfo:modified :  2011-04-27T18:36:58Z
>> pdf:docinfo:created :   2011-04-27T18:33:06Z
>>
>>
>> The Dublin Core values don't use colon ':' but dot '.' and they show up fine. There are embedded spaces in some of the xmp values, but the pdf:docinfo:title has that, too, it shows up in the indexchecker output. I'm wondering if there's anything special about the pdf:docinfo that isn't generalized or is somehow configurable for generalization to other namespaces.
>>
>>  Thanks,
>>
>>  Joe
>>
>> -----Original Message-----
>> From: Markus Jelsma <[hidden email]>
>> Sent: Tuesday, December 31, 2019 8:30 AM
>> To: [hidden email]
>> Subject: RE: Extracting XMP metadata from PDF for indexing Nutch 1.15
>>
>> Hello Joseph,
>>
>>> Is there more documentation on having Nutch get what Tika sees into what Solr will see?
>>
>> No, but i believe you would want to checkout the parsechecker and indexchecker tools. These tools display what Tika sees and what will be sent to Solr.
>>
>> Regards,
>> Markus
>>  
>> -----Original message-----
>>> From:Gilvary, Joseph <[hidden email]>
>>> Sent: Tuesday 31st December 2019 14:19
>>> To: [hidden email]
>>> Subject: Extracting XMP metadata from PDF for indexing Nutch 1.15
>>>
>>> Happy New Year,
>>>
>>> I've searched the archives and the web as best I can, tinkered with nutch-site.xml and schema.xml, but I can't get XMP metadata that's in the parse metadata into the Solr (7.6) index.
>>>
>>> I want to index stuff like:
>>>
>>> xmp:CreatorTool=PScript5.dll Version 5.2.2
>>> xmpTPg:NPages=23
>>>
>>> I get the pdf:docinfo:created, pdf:docinfo:modified, etc. fine, but swapping out ':' for '_' isn't working for the xmp stuff.
>>>
>>> Is there more documentation on having Nutch get what Tika sees into what Solr will see?
>>>
>>> Any help appreciated.
>>>
>>> Thanks,
>>>
>>> Joe
>>>
>