Extract all image and video links from a web page

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Extract all image and video links from a web page

prateek sachdeva
Hi Folks,

A very happy new year to all of you.

I am currently using Apache nutch 1.16 and successfully extracting the html
content given seed urls. One of the requirements I have is to extract all
the image and video links from the html in a separate object. Since I have
the html content, I can use a library like jsoup to parse the content and
extract img tags.
I was wondering if there is a way in nutch to do this?
I am assuming I will have to override HtmlParseFilter class and then add my
extraction logic there. Is my understanding correct? Any sample code
reference will be helpful as well.

Thanks
Prateek
Reply | Threaded
Open this post in threaded view
|

Re: Extract all image and video links from a web page

lewis john mcgibbney-2
Hi prateek,
Please see my comment inline below

On Thu, Jan 14, 2021 at 6:39 AM <[hidden email]> wrote:

>
> One of the requirements I have is to extract all
> the image and video links from the html in a separate object. Since I have
> the html content, I can use a library like jsoup to parse the content and
> extract img tags.
> I was wondering if there is a way in nutch to do this?
>

The problem here is your requirement of "... in a separate object". Will
this separate object be a new record?


> I am assuming I will have to override HtmlParseFilter class and then add my
> extraction logic there. Is my understanding correct? Any sample code
> reference will be helpful as well.
>
>
I think you can simply add parse-html OR parse-tika AND parse-xsl to the
'plugin.includes' configuration property and then use the ordered
HTMLParseFilter configuration option 'htmlparsefilter.order' as follows
https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1599

You can take a look at the parse-xsl plugin
https://github.com/apache/nutch/pull/439/files#diff-bb284524d36ab1d537581c95eb200b98a9e28bb8a8b48329914d2e09f6413d36

N.B. This patch is not yet merged into the Nutch master branch so it is not
available in an official Nutch release. You would need to upgrade to Nutch
1.18-SNAPSHOT master branch and then apply the branch. Any feedback would
be greatly appreciated.

--
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc
Reply | Threaded
Open this post in threaded view
|

Re: Extract all image and video links from a web page

prateek sachdeva
Hi Lewis,

Thanks for your reply. Unfortunately, I don't have the liberty to update my
current version to an unreleased version and hence the suggestion to use
parse-xsl won't be useful at this time. Is the only other option is to
override HtmlParseFilter and add a new plugin?

Also regarding separate objects, what i meant is if i store the image links
in Outlink, then those links will also be stored in DB (because all outlink
are stored for next crawl of depth > 1). I don't want to store those in
crawldb and just output in some other object within the record. I hope this
makes sense

Regards
Prateek

On Thu, Jan 14, 2021 at 6:28 PM lewis john mcgibbney <[hidden email]>
wrote:

> Hi prateek,
> Please see my comment inline below
>
> On Thu, Jan 14, 2021 at 6:39 AM <[hidden email]> wrote:
>
> >
> > One of the requirements I have is to extract all
> > the image and video links from the html in a separate object. Since I
> have
> > the html content, I can use a library like jsoup to parse the content and
> > extract img tags.
> > I was wondering if there is a way in nutch to do this?
> >
>
> The problem here is your requirement of "... in a separate object". Will
> this separate object be a new record?
>
>
> > I am assuming I will have to override HtmlParseFilter class and then add
> my
> > extraction logic there. Is my understanding correct? Any sample code
> > reference will be helpful as well.
> >
> >
> I think you can simply add parse-html OR parse-tika AND parse-xsl to the
> 'plugin.includes' configuration property and then use the ordered
> HTMLParseFilter configuration option 'htmlparsefilter.order' as follows
> https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1599
>
> You can take a look at the parse-xsl plugin
>
> https://github.com/apache/nutch/pull/439/files#diff-bb284524d36ab1d537581c95eb200b98a9e28bb8a8b48329914d2e09f6413d36
>
> N.B. This patch is not yet merged into the Nutch master branch so it is not
> available in an official Nutch release. You would need to upgrade to Nutch
> 1.18-SNAPSHOT master branch and then apply the branch. Any feedback would
> be greatly appreciated.
>
> --
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
>
Reply | Threaded
Open this post in threaded view
|

Re: Extract all image and video links from a web page

lewis john mcgibbney-2
Hi Prateek,

On 2021/01/19 15:58:29, prateek <[hidden email]> wrote:
> Is the only other option is to
> override HtmlParseFilter and add a new plugin?

Yes I think it is.

>
> Also regarding separate objects, what i meant is if i store the image links
> in Outlink, then those links will also be stored in DB (because all outlink
> are stored for next crawl of depth > 1). I don't want to store those in
> crawldb and just output in some other object within the record. I hope this
> makes sense

I understand. Seeing as you cannot upgrade then yes I think you need to implement a new plugin to capture the outlinks as a new field in the NutchDocument. You should also look into using the 'parser.html.outlinks.ignore_tags' configuration setting. You can specify which tags are filtered.

lewismc
Reply | Threaded
Open this post in threaded view
|

Re: Extract all image and video links from a web page

prateek sachdeva
Hi Lewis,

Thanks for your suggestion.

I looked at the class fetching outlinks and saw that "img" is already part
of that -
https://github.com/apache/nutch/blob/680df6ba1dc68ad5ede5fca743304593d4d5b0a3/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L90.
So I am confused as to why I don't see any images in outlinks.
I have double checked that the property parser.html.outlinks.ignore_tags is
also not set. So ideally images should be part of outlinks already. But
when I run "bin/nutch readseg" to see the segments data, I don't see any
images being captured. Any Idea what am I missing?

If there is a way I can get all images in outlinks, then maybe I don't even
need a plugin for that.

Regards
Prateek

On Wed, Jan 20, 2021 at 5:37 PM Lewis John McGibbney <[hidden email]>
wrote:

> Hi Prateek,
>
> On 2021/01/19 15:58:29, prateek <[hidden email]> wrote:
> > Is the only other option is to
> > override HtmlParseFilter and add a new plugin?
>
> Yes I think it is.
>
> >
> > Also regarding separate objects, what i meant is if i store the image
> links
> > in Outlink, then those links will also be stored in DB (because all
> outlink
> > are stored for next crawl of depth > 1). I don't want to store those in
> > crawldb and just output in some other object within the record. I hope
> this
> > makes sense
>
> I understand. Seeing as you cannot upgrade then yes I think you need to
> implement a new plugin to capture the outlinks as a new field in the
> NutchDocument. You should also look into using the
> 'parser.html.outlinks.ignore_tags' configuration setting. You can specify
> which tags are filtered.
>
> lewismc
>
Reply | Threaded
Open this post in threaded view
|

Re: Extract all image and video links from a web page

Sebastian Nagel-2
Hi Prateek,

are there any URL filters which filter away image links?

You can verify this using the URL filter checker:

  echo "https://example.com/image.jpg" \
    | bin/nutch filterchecker -stdin

The default rules in conf/regex-urlfilter.txt exclude common
image suffixes. Note that there can be more URL filters activated
in the property plugin.includes.

Best,
Sebastian

On 1/26/21 3:14 PM, prateek wrote:

> Hi Lewis,
>
> Thanks for your suggestion.
>
> I looked at the class fetching outlinks and saw that "img" is already part
> of that -
> https://github.com/apache/nutch/blob/680df6ba1dc68ad5ede5fca743304593d4d5b0a3/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L90.
> So I am confused as to why I don't see any images in outlinks.
> I have double checked that the property parser.html.outlinks.ignore_tags is
> also not set. So ideally images should be part of outlinks already. But
> when I run "bin/nutch readseg" to see the segments data, I don't see any
> images being captured. Any Idea what am I missing?
>
> If there is a way I can get all images in outlinks, then maybe I don't even
> need a plugin for that.
>
> Regards
> Prateek
>
> On Wed, Jan 20, 2021 at 5:37 PM Lewis John McGibbney <[hidden email]>
> wrote:
>
>> Hi Prateek,
>>
>> On 2021/01/19 15:58:29, prateek <[hidden email]> wrote:
>>> Is the only other option is to
>>> override HtmlParseFilter and add a new plugin?
>>
>> Yes I think it is.
>>
>>>
>>> Also regarding separate objects, what i meant is if i store the image
>> links
>>> in Outlink, then those links will also be stored in DB (because all
>> outlink
>>> are stored for next crawl of depth > 1). I don't want to store those in
>>> crawldb and just output in some other object within the record. I hope
>> this
>>> makes sense
>>
>> I understand. Seeing as you cannot upgrade then yes I think you need to
>> implement a new plugin to capture the outlinks as a new field in the
>> NutchDocument. You should also look into using the
>> 'parser.html.outlinks.ignore_tags' configuration setting. You can specify
>> which tags are filtered.
>>
>> lewismc
>>
>