NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

David Ferrero
Pull request #205 was recently merged into master branch for Nutch 1.x in fulfillment of NUTCH-1129 "microdata for Nutch 1.x"

I am new to nutch and solr and have just started crawling and indexing a few select websites. Using the built in html parsing/indexing, I am getting searchable fields like url, content, host, sometimes a title, and a few other indexing related fields like digest, boost, segment, and tstamp. That said, I realized very quickly that I need better results. While exploring the source of the website, I noticed references to schema.org and get excited by what I see. That’s how I stumbled upon NUTCH-1129.

I’ve built apache-nutch-1.15-SNAPSHOT which includes Any23 parser/indexer.

Q: Now what?  How do I gain Any23 microdata parsing / indexing capabilities introduced by NUTCH-1129?
Q: Do I replace parse-(html | tika)|index-(basic | anchor) in plugin.includes with something like parse-(html | tika | any23)|index-(basic | anchor | any23)
Q: How do I expose the discovered microdata structure / items to end-user such as Solr? For example, what are the microdata items and do I need to map them to Solr in solrindex-mapping.xml?

I’d also be interested to learn how to point at a specific URL and see how nutch sees the microdata (best case), then learn how to leverage this into nutch and finally into solr.

Thanks for any guidance.
-David
Reply | Threaded
Open this post in threaded view
|

Re: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

lewis john mcgibbney-2
Hi David,
Answers inline

On Thu, Feb 8, 2018 at 9:19 AM, <[hidden email]> wrote:

>
> From: David Ferrero <[hidden email]>
> To: [hidden email]
> Cc:
> Bcc:
> Date: Thu, 8 Feb 2018 10:19:52 -0700
> Subject: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?
> Pull request #205 was recently merged into master branch for Nutch 1.x in
> fulfillment of NUTCH-1129 "microdata for Nutch 1.x"
>
> I am new to nutch and solr and have just started crawling and indexing a
> few select websites. Using the built in html parsing/indexing, I am getting
> searchable fields like url, content, host, sometimes a title, and a few
> other indexing related fields like digest, boost, segment, and tstamp. That
> said, I realized very quickly that I need better results. While exploring
> the source of the website, I noticed references to schema.org and get
> excited by what I see. That’s how I stumbled upon NUTCH-1129.
>
> I’ve built apache-nutch-1.15-SNAPSHOT which includes Any23 parser/indexer.
>

Excellent.


>
> Q: Now what?  How do I gain Any23 microdata parsing / indexing
> capabilities introduced by NUTCH-1129?
> Q: Do I replace parse-(html | tika)|index-(basic | anchor) in
> plugin.includes with something like parse-(html | tika |
> any23)|index-(basic | anchor | any23)
>

No, you just add 'any23' to the list of plugins within the plugin.includes
property of nutch-site.xml


> Q: How do I expose the discovered microdata structure / items to end-user
> such as Solr? For example, what are the microdata items and do I need to
> map them to Solr in solrindex-mapping.xml?
>

OK, so current configuration for the Any23 plugin, is to store extracted
structured data markup in the Nutch Metadata object with a key "
Any23-Triples". You can locate it using something like the ParserChekcer
tool provided via the 'nutch' script. Liekwise you can also locate it, as a
representation of what would be indexed, by using the IndexerChecker
tooling also provided within the 'nutch' script.

An example would be as follows, data is now indexed as follows (example
after crawling https://smartive.ch/jobs):


          "structured_data": [
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"IE-edge,chrome=1\"@de",
              "key": "<http://vocab.sindice.net/any23#X-UA-Compatible>",
              "short_key": "X-UA-Compatible"
            },
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"Wir sind smartive \\u2014 eine dynamische,
innovative Schweizer Webentwicklungsagentur. Die Realisierung
zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer
Passion, wie die konstruktive Zusammenarbeit mit unseren Kundinnen und
Kunden.\"@de",
              "key": "<http://vocab.sindice.net/any23#description>",
              "short_key": "description"
            },
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"width=device-width, initial-scale=1,
shrink-to-fit=no\"@de",
              "key": "<http://vocab.sindice.net/any23#viewport>",
              "short_key": "viewport"
            },
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"width=device-width,initial-scale=1\"@de",
              "key": "<http://vocab.sindice.net/any23#viewport>",
              "short_key": "viewport"
            },
            {
              "node": "<https://smartive.ch/jobs>",
              "value": "\"ie=edge\"@de",
              "key": "<http://vocab.sindice.net/any23#x-ua-compatible>",
              "short_key": "x-ua-compatible"
            }
          ],


Note from above, that the 'predicate' key field is very useful for quickly
filtering through, for example, Hotel Ratings, or something similar.


>
> I’d also be interested to learn how to point at a specific URL and see how
> nutch sees the microdata (best case), then learn how to leverage this into
> nutch and finally into solr.
>
>
See the tooling for ParserChecker and IndexerChecker as explained above.
Any further question, please let me know.
Lewis
Reply | Threaded
Open this post in threaded view
|

Re: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

David Ferrero
Thank you for this information. Since this is very much related to Any23 and microdata parsing, I’m going to ask what I believe is a related question but keep this same thread so it will be organized in one place:

I noticed a lot of job boards such as dice.com <http://dice.com/>, monster.com <http://monster.com/>, etc use http://schema.org/JobPosting <http://schema.org/JobPosting> information, however many seem to use <script type="application/ld+json”>…</script> rather than RDF.
Summer 2017, Google announced structured data guidance for Jobs:
https://developers.google.com/search/docs/data-types/job-posting <https://developers.google.com/search/docs/data-types/job-posting>
and a testing tool to validate your HTML: https://search.google.com/structured-data/testing-tool
I verified a few sample listings on the above mentioned job boards on google’s testing-tool and they validate OK.

So after looking at http://any23.apache.org/getting-started.html <http://any23.apache.org/getting-started.html> for the supported extractors, I see Any23 mentions it supports JSON+LD input, so I added this to nutch-site.xml to override the same property in nutch-default.xml:

<property>
    <name>any23.extractors</name>
    <value>html-microdata,html-embedded-jsonld,rdf-jsonld</value>
    <description>Comma-separated list of Any23 extractors (a list of extractors is available here: http://any23.apache.org/getting-started.html)</description>
</property>

I expected to see additional information from nutch parsechecker after adding the jsonld extractors, however I see NO changes to Any23-Triples microdata parsed.

What might I be doing wrong?

> On Feb 8, 2018, at 11:17 AM, lewis john mcgibbney <[hidden email]> wrote:
>
> Hi David,
> Answers inline
>
> On Thu, Feb 8, 2018 at 9:19 AM, <[hidden email]> wrote:
>
>>
>> From: David Ferrero <[hidden email]>
>> To: [hidden email]
>> Cc:
>> Bcc:
>> Date: Thu, 8 Feb 2018 10:19:52 -0700
>> Subject: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?
>> Pull request #205 was recently merged into master branch for Nutch 1.x in
>> fulfillment of NUTCH-1129 "microdata for Nutch 1.x"
>>
>> I am new to nutch and solr and have just started crawling and indexing a
>> few select websites. Using the built in html parsing/indexing, I am getting
>> searchable fields like url, content, host, sometimes a title, and a few
>> other indexing related fields like digest, boost, segment, and tstamp. That
>> said, I realized very quickly that I need better results. While exploring
>> the source of the website, I noticed references to schema.org and get
>> excited by what I see. That’s how I stumbled upon NUTCH-1129.
>>
>> I’ve built apache-nutch-1.15-SNAPSHOT which includes Any23 parser/indexer.
>>
>
> Excellent.
>
>
>>
>> Q: Now what?  How do I gain Any23 microdata parsing / indexing
>> capabilities introduced by NUTCH-1129?
>> Q: Do I replace parse-(html | tika)|index-(basic | anchor) in
>> plugin.includes with something like parse-(html | tika |
>> any23)|index-(basic | anchor | any23)
>>
>
> No, you just add 'any23' to the list of plugins within the plugin.includes
> property of nutch-site.xml
>
>
>> Q: How do I expose the discovered microdata structure / items to end-user
>> such as Solr? For example, what are the microdata items and do I need to
>> map them to Solr in solrindex-mapping.xml?
>>
>
> OK, so current configuration for the Any23 plugin, is to store extracted
> structured data markup in the Nutch Metadata object with a key "
> Any23-Triples". You can locate it using something like the ParserChekcer
> tool provided via the 'nutch' script. Liekwise you can also locate it, as a
> representation of what would be indexed, by using the IndexerChecker
> tooling also provided within the 'nutch' script.
>
> An example would be as follows, data is now indexed as follows (example
> after crawling https://smartive.ch/jobs):
>
>
>          "structured_data": [
>            {
>              "node": "<https://smartive.ch/jobs>",
>              "value": "\"IE-edge,chrome=1\"@de",
>              "key": "<http://vocab.sindice.net/any23#X-UA-Compatible>",
>              "short_key": "X-UA-Compatible"
>            },
>            {
>              "node": "<https://smartive.ch/jobs>",
>              "value": "\"Wir sind smartive \\u2014 eine dynamische,
> innovative Schweizer Webentwicklungsagentur. Die Realisierung
> zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer
> Passion, wie die konstruktive Zusammenarbeit mit unseren Kundinnen und
> Kunden.\"@de",
>              "key": "<http://vocab.sindice.net/any23#description>",
>              "short_key": "description"
>            },
>            {
>              "node": "<https://smartive.ch/jobs>",
>              "value": "\"width=device-width, initial-scale=1,
> shrink-to-fit=no\"@de",
>              "key": "<http://vocab.sindice.net/any23#viewport>",
>              "short_key": "viewport"
>            },
>            {
>              "node": "<https://smartive.ch/jobs>",
>              "value": "\"width=device-width,initial-scale=1\"@de",
>              "key": "<http://vocab.sindice.net/any23#viewport>",
>              "short_key": "viewport"
>            },
>            {
>              "node": "<https://smartive.ch/jobs>",
>              "value": "\"ie=edge\"@de",
>              "key": "<http://vocab.sindice.net/any23#x-ua-compatible>",
>              "short_key": "x-ua-compatible"
>            }
>          ],
>
>
> Note from above, that the 'predicate' key field is very useful for quickly
> filtering through, for example, Hotel Ratings, or something similar.
>
>
>>
>> I’d also be interested to learn how to point at a specific URL and see how
>> nutch sees the microdata (best case), then learn how to leverage this into
>> nutch and finally into solr.
>>
>>
> See the tooling for ParserChecker and IndexerChecker as explained above.
> Any further question, please let me know.
> Lewis

Reply | Threaded
Open this post in threaded view
|

Re: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

David Ferrero
Just to be clear I'm using any23 in the plugin.includes, I am getting Any23-Triples metadata. However I am hoping to see more Any23-Triples when I added json+ld extractors to any23.extractors...

./bin/nutch parsechecker https://job-openings.monster.com/Senior-Software-Engineer-iOS-Boulder-CO-US-Housecanary/31/4edf245d-e563-43c2-9eca-e6c7c891d17a

fetching: https://job-openings.monster.com/Senior-Software-Engineer-iOS-Boulder-CO-US-Housecanary/31/4edf245d-e563-43c2-9eca-e6c7c891d17a
robots.txt whitelist not configured.
parsing: https://job-openings.monster.com/Senior-Software-Engineer-iOS-Boulder-CO-US-Housecanary/31/4edf245d-e563-43c2-9eca-e6c7c891d17a
contentType: text/html
signature: 84dcab335b0ae039c612dc8945bb9853
---------
Url
---------------

https://job-openings.monster.com/Senior-Software-Engineer-iOS-Boulder-CO-US-Housecanary/31/4edf245d-e563-43c2-9eca-e6c7c891d17a
---------
ParseData
---------

Version: 5
Status: success(1,0)
Title: Senior Software Engineer - iOS job at Housecanary | Monster.com
Outlinks: 1
  outlink: toUrl: https://job-openings.monster.com/Senior-Software-Engineer-iOS-Boulder-CO-US-Housecanary/31/4edf245d-e563-43c2-9eca-e6c7c891d17a anchor:
Content Metadata: Server=Microsoft-IIS/8.5 Access-Control-Allow-Origin=* Access-Control-Allow-Methods=PUT, GET, POST Connection=Close Pragma=no-cache Access-Control-Allow-Headers=Origin, X-Requested-With, Content-Type, Accept Date=Fri, 09 Feb 2018 07:42:02 GMT nutch.crawl.score=0.0 X-AspNetMvc-Version=5.2 nutch.fetch.time=1518162123187 X-Frame-Options=SAMEORIGIN X-UA-Compatible=IE=edge Cache-Control=no-cache, no-store Content-Encoding=gzip X-AspNet-Version=4.0.30319 Set-Cookie=atmResolver=|Seeker|jobviewcloud|58|164|; path=/; HttpOnly Expires=-1 Content-Length=37894 Content-Type=text/html; charset=utf-8 X-Powered-By=ASP.NET
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 Any23-Triples= viewport=user-scalable=no, width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1 description==Complete the job application for Senior Software Engineer - iOS in Boulder, CO online today or find more job listings available at Housecanary at Monster. robots=all format-detection=telephone=no

Here is the application/ld+json in that job page that I was hoping would get extracted with Any23...

<script type="application/ld+json">
        {"title":"Senior Software Engineer - iOS","datePosted":"2017-07-27","description":"<div><p>At HouseCanary, we’re using data and analytics to predict the future of US residential real estate. Our goal is to help people make better decisions by offering innovative and unparalleled insights. HouseCanary’s platform accurately forecasts values 36 months into the future for three million residential blocks and more than 100 million properties.</p><p>We're seeking a passionate Senior iOS Engineer to help build our disruptive mobile products.</p><p><strong>What you'll do:</strong></p><ul><li>Integrate maps and location tracking (driving directions, optimized routes, etc.)<li>Refine our Sketch feature used to design floor plans<li>Work with heavy core data<li>Camera integration<li>And more!</ul><p><strong>What you have:</strong></p><ul><li>Expertise in Swift and Objective-C<li>Experience with Cocoa Touch, UIKit, Core Animation, Core Graphics, Core Location, MapKit, etc.<li>Autolayout and Storyboards exposure<li>Core Data knowledge<li>Familiarity with multi-threading and GCD<li>Experience interfacing with RESTful JSON APIs</ul><p><strong>Bonus points for knowledge of:</strong></p><ul><li>Real estate markets<li>Predictive modeling</ul><p>HouseCanary is the authoritative source for accurate, uniform information, analyzed and visualized real-time to make better, faster decisions.</p><p>HouseCanary - see into the future of real estate.</p></div>","educationRequirements":"Not specified","jobLocation":{"address":{"addressLocality":"Boulder","addressRegion":"CO","addressCountry":"US","@type":"PostalAddress"},"@type":"Place"},"hiringOrganization":{"name":"Housecanary","@type":"Organization"},"experienceRequirements":"Not specified","identifier":{"name":"Housecanary","value":"202874","@type":"PropertyValue"},"@context":"http://schema.org","@type":"JobPosting"}
    </script>


> On Feb 9, 2018, at 12:31 AM, David Ferrero <[hidden email]> wrote:
>
> Thank you for this information. Since this is very much related to Any23 and microdata parsing, I’m going to ask what I believe is a related question but keep this same thread so it will be organized in one place:
>
> I noticed a lot of job boards such as dice.com, monster.com, etc use http://schema.org/JobPosting information, however many seem to use <script type="application/ld+json”>…</script> rather than RDF.
> Summer 2017, Google announced structured data guidance for Jobs:
> https://developers.google.com/search/docs/data-types/job-posting
> and a testing tool to validate your HTML: https://search.google.com/structured-data/testing-tool
> I verified a few sample listings on the above mentioned job boards on google’s testing-tool and they validate OK.
>
> So after looking at http://any23.apache.org/getting-started.html for the supported extractors, I see Any23 mentions it supports JSON+LD input, so I added this to nutch-site.xml to override the same property in nutch-default.xml:
>
> <property>
>     <name>any23.extractors</name>
>     <value>html-microdata,html-embedded-jsonld,rdf-jsonld</value>
>     <description>Comma-separated list of Any23 extractors (a list of extractors is available here: http://any23.apache.org/getting-started.html)</description>
> </property>
>
> I expected to see additional information from nutch parsechecker after adding the jsonld extractors, however I see NO changes to Any23-Triples microdata parsed.
>
> What might I be doing wrong?
>
>> On Feb 8, 2018, at 11:17 AM, lewis john mcgibbney <[hidden email]> wrote:
>>
>> Hi David,
>> Answers inline
>>
>> On Thu, Feb 8, 2018 at 9:19 AM, <[hidden email]> wrote:
>>
>>>
>>> From: David Ferrero <[hidden email]>
>>> To: [hidden email]
>>> Cc:
>>> Bcc:
>>> Date: Thu, 8 Feb 2018 10:19:52 -0700
>>> Subject: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?
>>> Pull request #205 was recently merged into master branch for Nutch 1.x in
>>> fulfillment of NUTCH-1129 "microdata for Nutch 1.x"
>>>
>>> I am new to nutch and solr and have just started crawling and indexing a
>>> few select websites. Using the built in html parsing/indexing, I am getting
>>> searchable fields like url, content, host, sometimes a title, and a few
>>> other indexing related fields like digest, boost, segment, and tstamp. That
>>> said, I realized very quickly that I need better results. While exploring
>>> the source of the website, I noticed references to schema.org and get
>>> excited by what I see. That’s how I stumbled upon NUTCH-1129.
>>>
>>> I’ve built apache-nutch-1.15-SNAPSHOT which includes Any23 parser/indexer.
>>>
>>
>> Excellent.
>>
>>
>>>
>>> Q: Now what?  How do I gain Any23 microdata parsing / indexing
>>> capabilities introduced by NUTCH-1129?
>>> Q: Do I replace parse-(html | tika)|index-(basic | anchor) in
>>> plugin.includes with something like parse-(html | tika |
>>> any23)|index-(basic | anchor | any23)
>>>
>>
>> No, you just add 'any23' to the list of plugins within the plugin.includes
>> property of nutch-site.xml
>>
>>
>>> Q: How do I expose the discovered microdata structure / items to end-user
>>> such as Solr? For example, what are the microdata items and do I need to
>>> map them to Solr in solrindex-mapping.xml?
>>>
>>
>> OK, so current configuration for the Any23 plugin, is to store extracted
>> structured data markup in the Nutch Metadata object with a key "
>> Any23-Triples". You can locate it using something like the ParserChekcer
>> tool provided via the 'nutch' script. Liekwise you can also locate it, as a
>> representation of what would be indexed, by using the IndexerChecker
>> tooling also provided within the 'nutch' script.
>>
>> An example would be as follows, data is now indexed as follows (example
>> after crawling https://smartive.ch/jobs):
>>
>>
>>          "structured_data": [
>>            {
>>              "node": "<https://smartive.ch/jobs>",
>>              "value": "\"IE-edge,chrome=1\"@de",
>>              "key": "<http://vocab.sindice.net/any23#X-UA-Compatible>",
>>              "short_key": "X-UA-Compatible"
>>            },
>>            {
>>              "node": "<https://smartive.ch/jobs>",
>>              "value": "\"Wir sind smartive \\u2014 eine dynamische,
>> innovative Schweizer Webentwicklungsagentur. Die Realisierung
>> zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer
>> Passion, wie die konstruktive Zusammenarbeit mit unseren Kundinnen und
>> Kunden.\"@de",
>>              "key": "<http://vocab.sindice.net/any23#description>",
>>              "short_key": "description"
>>            },
>>            {
>>              "node": "<https://smartive.ch/jobs>",
>>              "value": "\"width=device-width, initial-scale=1,
>> shrink-to-fit=no\"@de",
>>              "key": "<http://vocab.sindice.net/any23#viewport>",
>>              "short_key": "viewport"
>>            },
>>            {
>>              "node": "<https://smartive.ch/jobs>",
>>              "value": "\"width=device-width,initial-scale=1\"@de",
>>              "key": "<http://vocab.sindice.net/any23#viewport>",
>>              "short_key": "viewport"
>>            },
>>            {
>>              "node": "<https://smartive.ch/jobs>",
>>              "value": "\"ie=edge\"@de",
>>              "key": "<http://vocab.sindice.net/any23#x-ua-compatible>",
>>              "short_key": "x-ua-compatible"
>>            }
>>          ],
>>
>>
>> Note from above, that the 'predicate' key field is very useful for quickly
>> filtering through, for example, Hotel Ratings, or something similar.
>>
>>
>>>
>>> I’d also be interested to learn how to point at a specific URL and see how
>>> nutch sees the microdata (best case), then learn how to leverage this into
>>> nutch and finally into solr.
>>>
>>>
>> See the tooling for ParserChecker and IndexerChecker as explained above.
>> Any further question, please let me know.
>> Lewis
>

Reply | Threaded
Open this post in threaded view
|

Re: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

lewis john mcgibbney-2
In reply to this post by David Ferrero
Hi David,
We are in the process of releasing Any23 2.2, this will include the fix.
We can then come back to Nutch and make the upgrade and you should be all set.
Hopefully this will be achieved within around 72hrs. In the meantime, you can clone, build and deploy Any23 master. This will do the trick.
Lewis

On 2018/02/09 07:31:10, David Ferrero <[hidden email]> wrote:

> Thank you for this information. Since this is very much related to Any23 and microdata parsing, I’m going to ask what I believe is a related question but keep this same thread so it will be organized in one place:
>
> I noticed a lot of job boards such as dice.com <http://dice.com/>, monster.com <http://monster.com/>, etc use http://schema.org/JobPosting <http://schema.org/JobPosting> information, however many seem to use <script type="application/ld+json”>…</script> rather than RDF.
> Summer 2017, Google announced structured data guidance for Jobs:
> https://developers.google.com/search/docs/data-types/job-posting <https://developers.google.com/search/docs/data-types/job-posting>
> and a testing tool to validate your HTML: https://search.google.com/structured-data/testing-tool
> I verified a few sample listings on the above mentioned job boards on google’s testing-tool and they validate OK.
>
> So after looking at http://any23.apache.org/getting-started.html <http://any23.apache.org/getting-started.html> for the supported extractors, I see Any23 mentions it supports JSON+LD input, so I added this to nutch-site.xml to override the same property in nutch-default.xml:
>
> <property>
>     <name>any23.extractors</name>
>     <value>html-microdata,html-embedded-jsonld,rdf-jsonld</value>
>     <description>Comma-separated list of Any23 extractors (a list of extractors is available here: http://any23.apache.org/getting-started.html)</description>
> </property>
>
> I expected to see additional information from nutch parsechecker after adding the jsonld extractors, however I see NO changes to Any23-Triples microdata parsed.
>
> What might I be doing wrong?
>
> > On Feb 8, 2018, at 11:17 AM, lewis john mcgibbney <[hidden email]> wrote:
> >
> > Hi David,
> > Answers inline
> >
> > On Thu, Feb 8, 2018 at 9:19 AM, <[hidden email]> wrote:
> >
> >>
> >> From: David Ferrero <[hidden email]>
> >> To: [hidden email]
> >> Cc:
> >> Bcc:
> >> Date: Thu, 8 Feb 2018 10:19:52 -0700
> >> Subject: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?
> >> Pull request #205 was recently merged into master branch for Nutch 1.x in
> >> fulfillment of NUTCH-1129 "microdata for Nutch 1.x"
> >>
> >> I am new to nutch and solr and have just started crawling and indexing a
> >> few select websites. Using the built in html parsing/indexing, I am getting
> >> searchable fields like url, content, host, sometimes a title, and a few
> >> other indexing related fields like digest, boost, segment, and tstamp. That
> >> said, I realized very quickly that I need better results. While exploring
> >> the source of the website, I noticed references to schema.org and get
> >> excited by what I see. That’s how I stumbled upon NUTCH-1129.
> >>
> >> I’ve built apache-nutch-1.15-SNAPSHOT which includes Any23 parser/indexer.
> >>
> >
> > Excellent.
> >
> >
> >>
> >> Q: Now what?  How do I gain Any23 microdata parsing / indexing
> >> capabilities introduced by NUTCH-1129?
> >> Q: Do I replace parse-(html | tika)|index-(basic | anchor) in
> >> plugin.includes with something like parse-(html | tika |
> >> any23)|index-(basic | anchor | any23)
> >>
> >
> > No, you just add 'any23' to the list of plugins within the plugin.includes
> > property of nutch-site.xml
> >
> >
> >> Q: How do I expose the discovered microdata structure / items to end-user
> >> such as Solr? For example, what are the microdata items and do I need to
> >> map them to Solr in solrindex-mapping.xml?
> >>
> >
> > OK, so current configuration for the Any23 plugin, is to store extracted
> > structured data markup in the Nutch Metadata object with a key "
> > Any23-Triples". You can locate it using something like the ParserChekcer
> > tool provided via the 'nutch' script. Liekwise you can also locate it, as a
> > representation of what would be indexed, by using the IndexerChecker
> > tooling also provided within the 'nutch' script.
> >
> > An example would be as follows, data is now indexed as follows (example
> > after crawling https://smartive.ch/jobs):
> >
> >
> >          "structured_data": [
> >            {
> >              "node": "<https://smartive.ch/jobs>",
> >              "value": "\"IE-edge,chrome=1\"@de",
> >              "key": "<http://vocab.sindice.net/any23#X-UA-Compatible>",
> >              "short_key": "X-UA-Compatible"
> >            },
> >            {
> >              "node": "<https://smartive.ch/jobs>",
> >              "value": "\"Wir sind smartive \\u2014 eine dynamische,
> > innovative Schweizer Webentwicklungsagentur. Die Realisierung
> > zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer
> > Passion, wie die konstruktive Zusammenarbeit mit unseren Kundinnen und
> > Kunden.\"@de",
> >              "key": "<http://vocab.sindice.net/any23#description>",
> >              "short_key": "description"
> >            },
> >            {
> >              "node": "<https://smartive.ch/jobs>",
> >              "value": "\"width=device-width, initial-scale=1,
> > shrink-to-fit=no\"@de",
> >              "key": "<http://vocab.sindice.net/any23#viewport>",
> >              "short_key": "viewport"
> >            },
> >            {
> >              "node": "<https://smartive.ch/jobs>",
> >              "value": "\"width=device-width,initial-scale=1\"@de",
> >              "key": "<http://vocab.sindice.net/any23#viewport>",
> >              "short_key": "viewport"
> >            },
> >            {
> >              "node": "<https://smartive.ch/jobs>",
> >              "value": "\"ie=edge\"@de",
> >              "key": "<http://vocab.sindice.net/any23#x-ua-compatible>",
> >              "short_key": "x-ua-compatible"
> >            }
> >          ],
> >
> >
> > Note from above, that the 'predicate' key field is very useful for quickly
> > filtering through, for example, Hotel Ratings, or something similar.
> >
> >
> >>
> >> I’d also be interested to learn how to point at a specific URL and see how
> >> nutch sees the microdata (best case), then learn how to leverage this into
> >> nutch and finally into solr.
> >>
> >>
> > See the tooling for ParserChecker and IndexerChecker as explained above.
> > Any further question, please let me know.
> > Lewis
>
>
Reply | Threaded
Open this post in threaded view
|

Re: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

David Ferrero
Awesome on Any23 2.2 forthcoming release. I look forward to it and subsequent bump to Nutch.

In the meantime, I was successful to build Any23 from master, then copy the any23 jars into Nutch (master) then reference them in the plugin…
    <library name="apache-any23-api-2.3-SNAPSHOT.jar"/>
    <library name="apache-any23-core-2.3-SNAPSHOT.jar"/>
    <library name="apache-any23-csvutils-2.3-SNAPSHOT.jar"/>
    <library name="apache-any23-encoding-2.3-SNAPSHOT.jar"/>
    <library name="apache-any23-mime-2.3-SNAPSHOT.jar"/>

Unfortunately when I reran the nutch parsechecker it failed to parse anymore. A quick look at the logs/hadoop.log reveal that updated any23 depends on new classes in the other jar files:
Caused by: java.lang.NoClassDefFoundError: org/apache/commons/rdf/api/IRI
Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.semanticweb.owlapi.rio.OWLAPIRDFFormat
java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.jsoup.select.NodeTraversor.traverse(Lorg/jsoup/select/NodeVisitor;Lorg/jsoup/nodes/Node;)V

I guess I would need to rebuild nutch from master (rather than just copy a few jar files) and ensure that any23’s jar dependencies as also references..

> On Feb 9, 2018, at 1:45 PM, Lewis John McGibbney <[hidden email]> wrote:
>
> Hi David,
> We are in the process of releasing Any23 2.2, this will include the fix.
> We can then come back to Nutch and make the upgrade and you should be all set.
> Hopefully this will be achieved within around 72hrs. In the meantime, you can clone, build and deploy Any23 master. This will do the trick.
> Lewis
>
> On 2018/02/09 07:31:10, David Ferrero <[hidden email]> wrote:
>> Thank you for this information. Since this is very much related to Any23 and microdata parsing, I’m going to ask what I believe is a related question but keep this same thread so it will be organized in one place:
>>
>> I noticed a lot of job boards such as dice.com <http://dice.com/>, monster.com <http://monster.com/>, etc use http://schema.org/JobPosting <http://schema.org/JobPosting> information, however many seem to use <script type="application/ld+json”>…</script> rather than RDF.
>> Summer 2017, Google announced structured data guidance for Jobs:
>> https://developers.google.com/search/docs/data-types/job-posting <https://developers.google.com/search/docs/data-types/job-posting>
>> and a testing tool to validate your HTML: https://search.google.com/structured-data/testing-tool
>> I verified a few sample listings on the above mentioned job boards on google’s testing-tool and they validate OK.
>>
>> So after looking at http://any23.apache.org/getting-started.html <http://any23.apache.org/getting-started.html> for the supported extractors, I see Any23 mentions it supports JSON+LD input, so I added this to nutch-site.xml to override the same property in nutch-default.xml:
>>
>> <property>
>>    <name>any23.extractors</name>
>>    <value>html-microdata,html-embedded-jsonld,rdf-jsonld</value>
>>    <description>Comma-separated list of Any23 extractors (a list of extractors is available here: http://any23.apache.org/getting-started.html)</description>
>> </property>
>>
>> I expected to see additional information from nutch parsechecker after adding the jsonld extractors, however I see NO changes to Any23-Triples microdata parsed.
>>
>> What might I be doing wrong?
>>
>>> On Feb 8, 2018, at 11:17 AM, lewis john mcgibbney <[hidden email]> wrote:
>>>
>>> Hi David,
>>> Answers inline
>>>
>>> On Thu, Feb 8, 2018 at 9:19 AM, <[hidden email]> wrote:
>>>
>>>>
>>>> From: David Ferrero <[hidden email]>
>>>> To: [hidden email]
>>>> Cc:
>>>> Bcc:
>>>> Date: Thu, 8 Feb 2018 10:19:52 -0700
>>>> Subject: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?
>>>> Pull request #205 was recently merged into master branch for Nutch 1.x in
>>>> fulfillment of NUTCH-1129 "microdata for Nutch 1.x"
>>>>
>>>> I am new to nutch and solr and have just started crawling and indexing a
>>>> few select websites. Using the built in html parsing/indexing, I am getting
>>>> searchable fields like url, content, host, sometimes a title, and a few
>>>> other indexing related fields like digest, boost, segment, and tstamp. That
>>>> said, I realized very quickly that I need better results. While exploring
>>>> the source of the website, I noticed references to schema.org and get
>>>> excited by what I see. That’s how I stumbled upon NUTCH-1129.
>>>>
>>>> I’ve built apache-nutch-1.15-SNAPSHOT which includes Any23 parser/indexer.
>>>>
>>>
>>> Excellent.
>>>
>>>
>>>>
>>>> Q: Now what?  How do I gain Any23 microdata parsing / indexing
>>>> capabilities introduced by NUTCH-1129?
>>>> Q: Do I replace parse-(html | tika)|index-(basic | anchor) in
>>>> plugin.includes with something like parse-(html | tika |
>>>> any23)|index-(basic | anchor | any23)
>>>>
>>>
>>> No, you just add 'any23' to the list of plugins within the plugin.includes
>>> property of nutch-site.xml
>>>
>>>
>>>> Q: How do I expose the discovered microdata structure / items to end-user
>>>> such as Solr? For example, what are the microdata items and do I need to
>>>> map them to Solr in solrindex-mapping.xml?
>>>>
>>>
>>> OK, so current configuration for the Any23 plugin, is to store extracted
>>> structured data markup in the Nutch Metadata object with a key "
>>> Any23-Triples". You can locate it using something like the ParserChekcer
>>> tool provided via the 'nutch' script. Liekwise you can also locate it, as a
>>> representation of what would be indexed, by using the IndexerChecker
>>> tooling also provided within the 'nutch' script.
>>>
>>> An example would be as follows, data is now indexed as follows (example
>>> after crawling https://smartive.ch/jobs):
>>>
>>>
>>>         "structured_data": [
>>>           {
>>>             "node": "<https://smartive.ch/jobs>",
>>>             "value": "\"IE-edge,chrome=1\"@de",
>>>             "key": "<http://vocab.sindice.net/any23#X-UA-Compatible>",
>>>             "short_key": "X-UA-Compatible"
>>>           },
>>>           {
>>>             "node": "<https://smartive.ch/jobs>",
>>>             "value": "\"Wir sind smartive \\u2014 eine dynamische,
>>> innovative Schweizer Webentwicklungsagentur. Die Realisierung
>>> zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer
>>> Passion, wie die konstruktive Zusammenarbeit mit unseren Kundinnen und
>>> Kunden.\"@de",
>>>             "key": "<http://vocab.sindice.net/any23#description>",
>>>             "short_key": "description"
>>>           },
>>>           {
>>>             "node": "<https://smartive.ch/jobs>",
>>>             "value": "\"width=device-width, initial-scale=1,
>>> shrink-to-fit=no\"@de",
>>>             "key": "<http://vocab.sindice.net/any23#viewport>",
>>>             "short_key": "viewport"
>>>           },
>>>           {
>>>             "node": "<https://smartive.ch/jobs>",
>>>             "value": "\"width=device-width,initial-scale=1\"@de",
>>>             "key": "<http://vocab.sindice.net/any23#viewport>",
>>>             "short_key": "viewport"
>>>           },
>>>           {
>>>             "node": "<https://smartive.ch/jobs>",
>>>             "value": "\"ie=edge\"@de",
>>>             "key": "<http://vocab.sindice.net/any23#x-ua-compatible>",
>>>             "short_key": "x-ua-compatible"
>>>           }
>>>         ],
>>>
>>>
>>> Note from above, that the 'predicate' key field is very useful for quickly
>>> filtering through, for example, Hotel Ratings, or something similar.
>>>
>>>
>>>>
>>>> I’d also be interested to learn how to point at a specific URL and see how
>>>> nutch sees the microdata (best case), then learn how to leverage this into
>>>> nutch and finally into solr.
>>>>
>>>>
>>> See the tooling for ParserChecker and IndexerChecker as explained above.
>>> Any further question, please let me know.
>>> Lewis
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

lewis john mcgibbney-2
In reply to this post by David Ferrero
Hi David,
The java.lang.NoClassDefFoundError issues could be resolved simply by
including the correct Jar artifacts.
 We will have the issue resolved correctly very soon and I will let you
know when Any23 2.2 is released.
Lewis

On Sat, Feb 10, 2018 at 11:42 AM, <[hidden email]> wrote:

> From: David Ferrero <[hidden email]>
> To: [hidden email]
> Cc:
> Bcc:
> Date: Sat, 10 Feb 2018 12:41:57 -0700
> Subject: Re: NUTCH-1129, Any23, microdata parsing, indexing, and
> extraction?
> Awesome on Any23 2.2 forthcoming release. I look forward to it and
> subsequent bump to Nutch.
>
> In the meantime, I was successful to build Any23 from master, then copy
> the any23 jars into Nutch (master) then reference them in the plugin…
>     <library name="apache-any23-api-2.3-SNAPSHOT.jar"/>
>     <library name="apache-any23-core-2.3-SNAPSHOT.jar"/>
>     <library name="apache-any23-csvutils-2.3-SNAPSHOT.jar"/>
>     <library name="apache-any23-encoding-2.3-SNAPSHOT.jar"/>
>     <library name="apache-any23-mime-2.3-SNAPSHOT.jar"/>
>
> Unfortunately when I reran the nutch parsechecker it failed to parse
> anymore. A quick look at the logs/hadoop.log reveal that updated any23
> depends on new classes in the other jar files:
> Caused by: java.lang.NoClassDefFoundError: org/apache/commons/rdf/api/IRI
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class
> org.semanticweb.owlapi.rio.OWLAPIRDFFormat
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError:
> org.jsoup.select.NodeTraversor.traverse(Lorg/
> jsoup/select/NodeVisitor;Lorg/jsoup/nodes/Node;)V
>
> I guess I would need to rebuild nutch from master (rather than just copy a
> few jar files) and ensure that any23’s jar dependencies as also references..
>
> > On Feb 9, 2018, at 1:45 PM, Lewis John McGibbney <[hidden email]>
> wrote:
> >
> > Hi David,
> > We are in the process of releasing Any23 2.2, this will include the fix.
> > We can then come back to Nutch and make the upgrade and you should be
> all set.
> > Hopefully this will be achieved within around 72hrs. In the meantime,
> you can clone, build and deploy Any23 master. This will do the trick.
> > Lewis
> >
> > On 2018/02/09 07:31:10, David Ferrero <[hidden email]> wrote:
> >> Thank you for this information. Since this is very much related to
> Any23 and microdata parsing, I’m going to ask what I believe is a related
> question but keep this same thread so it will be organized in one place:
> >>
> >> I noticed a lot of job boards such as dice.com <http://dice.com/>,
> monster.com <http://monster.com/>, etc use http://schema.org/JobPosting <
> http://schema.org/JobPosting> information, however many seem to use
> <script type="application/ld+json†>…</script> rather than RDF.
> >> Summer 2017, Google announced structured data guidance for Jobs:
> >> https://developers.google.com/search/docs/data-types/job-posting <
> https://developers.google.com/search/docs/data-types/job-posting>
> >> and a testing tool to validate your HTML: https://search.google.com/
> structured-data/testing-tool
> >> I verified a few sample listings on the above mentioned job boards on
> google’s testing-tool and they validate OK.
> >>
> >> So after looking at http://any23.apache.org/getting-started.html <
> http://any23.apache.org/getting-started.html> for the supported
> extractors, I see Any23 mentions it supports JSON+LD input, so I added this
> to nutch-site.xml to override the same property in nutch-default.xml:
> >>
> >> <property>
> >>    <name>any23.extractors</name>
> >>    <value>html-microdata,html-embedded-jsonld,rdf-jsonld</value>
> >>    <description>Comma-separated list of Any23 extractors (a list of
> extractors is available here: http://any23.apache.org/getting-started.html
> )</description>
> >> </property>
> >>
> >> I expected to see additional information from nutch parsechecker after
> adding the jsonld extractors, however I see NO changes to Any23-Triples
> microdata parsed.
> >>
> >> What might I be doing wrong?
> >>
> >>> On Feb 8, 2018, at 11:17 AM, lewis john mcgibbney <[hidden email]>
> wrote:
> >>>
> >>> Hi David,
> >>> Answers inline
> >>>
> >>> On Thu, Feb 8, 2018 at 9:19 AM, <[hidden email]>
> wrote:
> >>>
> >>>>
> >>>> From: David Ferrero <[hidden email]>
> >>>> To: [hidden email]
> >>>> Cc:
> >>>> Bcc:
> >>>> Date: Thu, 8 Feb 2018 10:19:52 -0700
> >>>> Subject: NUTCH-1129, Any23, microdata parsing, indexing, and
> extraction?
> >>>> Pull request #205 was recently merged into master branch for Nutch
> 1.x in
> >>>> fulfillment of NUTCH-1129 "microdata for Nutch 1.x"
> >>>>
> >>>> I am new to nutch and solr and have just started crawling and
> indexing a
> >>>> few select websites. Using the built in html parsing/indexing, I am
> getting
> >>>> searchable fields like url, content, host, sometimes a title, and a
> few
> >>>> other indexing related fields like digest, boost, segment, and
> tstamp. That
> >>>> said, I realized very quickly that I need better results. While
> exploring
> >>>> the source of the website, I noticed references to schema.org and get
> >>>> excited by what I see. That’s how I stumbled upon NUTCH-1129.
> >>>>
> >>>> I’ve built apache-nutch-1.15-SNAPSHOT which includes Any23
> parser/indexer.
> >>>>
> >>>
> >>> Excellent.
> >>>
> >>>
> >>>>
> >>>> Q: Now what?  How do I gain Any23 microdata parsing / indexing
> >>>> capabilities introduced by NUTCH-1129?
> >>>> Q: Do I replace parse-(html | tika)|index-(basic | anchor) in
> >>>> plugin.includes with something like parse-(html | tika |
> >>>> any23)|index-(basic | anchor | any23)
> >>>>
> >>>
> >>> No, you just add 'any23' to the list of plugins within the
> plugin.includes
> >>> property of nutch-site.xml
> >>>
> >>>
> >>>> Q: How do I expose the discovered microdata structure / items to
> end-user
> >>>> such as Solr? For example, what are the microdata items and do I need
> to
> >>>> map them to Solr in solrindex-mapping.xml?
> >>>>
> >>>
> >>> OK, so current configuration for the Any23 plugin, is to store
> extracted
> >>> structured data markup in the Nutch Metadata object with a key "
> >>> Any23-Triples". You can locate it using something like the
> ParserChekcer
> >>> tool provided via the 'nutch' script. Liekwise you can also locate it,
> as a
> >>> representation of what would be indexed, by using the IndexerChecker
> >>> tooling also provided within the 'nutch' script.
> >>>
> >>> An example would be as follows, data is now indexed as follows (example
> >>> after crawling https://smartive.ch/jobs):
> >>>
> >>>
> >>>         "structured_data": [
> >>>           {
> >>>             "node": "<https://smartive.ch/jobs>",
> >>>             "value": "\"IE-edge,chrome=1\"@de",
> >>>             "key": "<http://vocab.sindice.net/any23#X-UA-Compatible>",
> >>>             "short_key": "X-UA-Compatible"
> >>>           },
> >>>           {
> >>>             "node": "<https://smartive.ch/jobs>",
> >>>             "value": "\"Wir sind smartive \\u2014 eine dynamische,
> >>> innovative Schweizer Webentwicklungsagentur. Die Realisierung
> >>> zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer
> >>> Passion, wie die konstruktive Zusammenarbeit mit unseren Kundinnen und
> >>> Kunden.\"@de",
> >>>             "key": "<http://vocab.sindice.net/any23#description>",
> >>>             "short_key": "description"
> >>>           },
> >>>           {
> >>>             "node": "<https://smartive.ch/jobs>",
> >>>             "value": "\"width=device-width, initial-scale=1,
> >>> shrink-to-fit=no\"@de",
> >>>             "key": "<http://vocab.sindice.net/any23#viewport>",
> >>>             "short_key": "viewport"
> >>>           },
> >>>           {
> >>>             "node": "<https://smartive.ch/jobs>",
> >>>             "value": "\"width=device-width,initial-scale=1\"@de",
> >>>             "key": "<http://vocab.sindice.net/any23#viewport>",
> >>>             "short_key": "viewport"
> >>>           },
> >>>           {
> >>>             "node": "<https://smartive.ch/jobs>",
> >>>             "value": "\"ie=edge\"@de",
> >>>             "key": "<http://vocab.sindice.net/any23#x-ua-compatible>",
> >>>             "short_key": "x-ua-compatible"
> >>>           }
> >>>         ],
> >>>
> >>>
> >>> Note from above, that the 'predicate' key field is very useful for
> quickly
> >>> filtering through, for example, Hotel Ratings, or something similar.
> >>>
> >>>
> >>>>
> >>>> I’d also be interested to learn how to point at a specific URL and
> see how
> >>>> nutch sees the microdata (best case), then learn how to leverage this
> into
> >>>> nutch and finally into solr.
> >>>>
> >>>>
> >>> See the tooling for ParserChecker and IndexerChecker as explained
> above.
> >>> Any further question, please let me know.
> >>> Lewis
> >>
> >>
>
>
>


--
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc