Include parent URL in pdf data - nutch

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Include parent URL in pdf data - nutch

UMA MAHESWAR
I am using nutch1.x for website cawing and indexing in solr(5.5.0).
I am trying to include the parent URL along with pdf data .
Can someone please suggest me some way to do it ?

Thanks in advance for your comments and suggestions



--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
Reply | Threaded
Open this post in threaded view
|

Re: Include parent URL in pdf data - nutch

Sebastian Nagel-2
Hi,

could you explain in detail what is meant by "parent URL"?
- the page the PDF document is linked from
- a redirect pointing to the PDF doc
- the "directory" of the PDF URL (clip URL after last "/")
- ...

Nutch indexes all successfully fetched pages but not redirects,
404s, etc. Of course, pages not crawled cannot be indexed.

Best,
Sebastian

On 09/27/2018 11:58 AM, UMA MAHESWAR wrote:

> I am using nutch1.x for website cawing and indexing in solr(5.5.0).
> I am trying to include the parent URL along with pdf data .
> Can someone please suggest me some way to do it ?
>
> Thanks in advance for your comments and suggestions
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
>

Reply | Threaded
Open this post in threaded view
|

Re: Include parent URL in pdf data - nutch

UMA MAHESWAR
Hi Sir ,

By Parent URL , i mean the page the PDF document is linked from .

In other words , the name of website where the PDF is present in the site

Example : I am crawling multiple pdf from multiple websites . I just wanted
to index the respective website name along with each pdf crawled from
respective websites.

Thanks,
Uma



--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
Reply | Threaded
Open this post in threaded view
|

RE: [Non-DoD Source] Re: Include parent URL in pdf data - nutch

Musshorn, Kris T CTR USARMY CECOM (US)
In reply to this post by Sebastian Nagel-2
Please remove me from this list

-----Original Message-----
From: Sebastian Nagel [mailto:[hidden email]]
Sent: Friday, September 28, 2018 2:25 AM
To: [hidden email]
Subject: [Non-DoD Source] Re: Include parent URL in pdf data - nutch

All active links contained in this email were disabled.  Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser.  




----

Hi,

could you explain in detail what is meant by "parent URL"?
- the page the PDF document is linked from
- a redirect pointing to the PDF doc
- the "directory" of the PDF URL (clip URL after last "/")
- ...

Nutch indexes all successfully fetched pages but not redirects, 404s, etc. Of course, pages not crawled cannot be indexed.

Best,
Sebastian

On 09/27/2018 11:58 AM, UMA MAHESWAR wrote:

> I am using nutch1.x for website cawing and indexing in solr(5.5.0).
> I am trying to include the parent URL along with pdf data .
> Can someone please suggest me some way to do it ?
>
> Thanks in advance for your comments and suggestions
>
>
>
> --
> Sent from:
> Caution-http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
>

Reply | Threaded
Open this post in threaded view
|

Re: Include parent URL in pdf data - nutch

Jorge Betancourt
In reply to this post by UMA MAHESWAR
If I understand correctly, what you want is to index/store the URL where
the PDF link was found right? The name of the website we don't track (by
default). But you could do this (sort of) using the index-links plugin (
https://github.com/apache/nutch/tree/master/src/plugin/index-links).

This will allow you to index all the outlinks of a given URL. So if A is
the parent URL of B (pdf file), then you should be able to find the B URL
in the outlinks of A. This is basically reverting the problem, instead of
looking for the parent of B, you would be looking for any URL that has B
has an outlink. In theory you could find all the URLs that point to a
specific resource (PDF file).

Hope that helps,

Best Regards,
Jorge

On Fri, Sep 28, 2018 at 11:46 AM UMA MAHESWAR <[hidden email]>
wrote:

> Hi Sir ,
>
> By Parent URL , i mean the page the PDF document is linked from .
>
> In other words , the name of website where the PDF is present in the site
>
> Example : I am crawling multiple pdf from multiple websites . I just wanted
> to index the respective website name along with each pdf crawled from
> respective websites.
>
> Thanks,
> Uma
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
>
Reply | Threaded
Open this post in threaded view
|

Re: [Non-DoD Source] Re: Include parent URL in pdf data - nutch

Jorge Betancourt
In reply to this post by Musshorn, Kris T CTR USARMY CECOM (US)
Hi Musshorn,

You can take a look at http://nutch.apache.org/mailing_lists.html on how to
unsubscribe from the mailing list. Send an email to
[hidden email].

Best Regards,
Jorge

On Fri, Sep 28, 2018 at 1:24 PM Musshorn, Kris T CTR USARMY CECOM (US) <
[hidden email]> wrote:

> Please remove me from this list
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:[hidden email]]
> Sent: Friday, September 28, 2018 2:25 AM
> To: [hidden email]
> Subject: [Non-DoD Source] Re: Include parent URL in pdf data - nutch
>
> All active links contained in this email were disabled.  Please verify the
> identity of the sender, and confirm the authenticity of all links contained
> within the message prior to copying and pasting the address to a Web
> browser.
>
>
>
>
> ----
>
> Hi,
>
> could you explain in detail what is meant by "parent URL"?
> - the page the PDF document is linked from
> - a redirect pointing to the PDF doc
> - the "directory" of the PDF URL (clip URL after last "/")
> - ...
>
> Nutch indexes all successfully fetched pages but not redirects, 404s, etc.
> Of course, pages not crawled cannot be indexed.
>
> Best,
> Sebastian
>
> On 09/27/2018 11:58 AM, UMA MAHESWAR wrote:
> > I am using nutch1.x for website cawing and indexing in solr(5.5.0).
> > I am trying to include the parent URL along with pdf data .
> > Can someone please suggest me some way to do it ?
> >
> > Thanks in advance for your comments and suggestions
> >
> >
> >
> > --
> > Sent from:
> > Caution-http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
> >
>
>