Parsed segment has outlinks filtered

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Parsed segment has outlinks filtered

Sachin Mittal
Hi,
I was bit confused on the outlinks generated from a parsed url.
If I use the utility:

bin/nutch parsechecker url

The generated outlinks has all the outlinks.

However if I check the dump of parsed segment generated using nutch crawl
script using command:

bin/nutch readseg -dump /segments/<>/ /outputdir -nocontent -nofetch -
nogenerate -noparse -noparsetext

And I review the same entry's ParseData I see it has lot fewer outlinks.
Basically it has filtered out all the outlinks which did not match the
regex's defined in regex-urlfilter.txt.

So I want to know if there is a way to avoid this and make sure the
generated outlinks in the nutch segments contains all the urls and not just
the filtered ones.

Even if you can point to the code where this url filtering happens for
outlinks I can figure out a way to circumvent this.

Thanks
Sachin
Reply | Threaded
Open this post in threaded view
|

RE: Parsed segment has outlinks filtered

Yossi Tamari
Hi Sachin,

I'm not sure what you are trying to achieve: If you don't want to filter the outlinks, why do you enable urlfilter-regex?
Anyway, if you set the property parse.filter.urls to false, the Parser will not filter outlinks at all.

        Yossi.

-----Original Message-----
From: Sachin Mittal <[hidden email]>
Sent: Thursday, 17 October 2019 19:15
To: [hidden email]
Subject: Parsed segment has outlinks filtered

Hi,
I was bit confused on the outlinks generated from a parsed url.
If I use the utility:

bin/nutch parsechecker url

The generated outlinks has all the outlinks.

However if I check the dump of parsed segment generated using nutch crawl script using command:

bin/nutch readseg -dump /segments/<>/ /outputdir -nocontent -nofetch - nogenerate -noparse -noparsetext

And I review the same entry's ParseData I see it has lot fewer outlinks.
Basically it has filtered out all the outlinks which did not match the regex's defined in regex-urlfilter.txt.

So I want to know if there is a way to avoid this and make sure the generated outlinks in the nutch segments contains all the urls and not just the filtered ones.

Even if you can point to the code where this url filtering happens for outlinks I can figure out a way to circumvent this.

Thanks
Sachin

Reply | Threaded
Open this post in threaded view
|

Re: Parsed segment has outlinks filtered

Sachin Mittal
Hi,

Thanks I figured this out. Lets hope it works!.

urlfilter-regex is required to filter out the urls for next crawl, however
I still want to index all the outlinks for my current url.
The reason is that I may not want nutch to crawl these outlinks in next
round, but I may still want some other crawler to scrape these urls.

Sachin


On Thu, Oct 17, 2019 at 10:01 PM <[hidden email]> wrote:

> Hi Sachin,
>
> I'm not sure what you are trying to achieve: If you don't want to filter
> the outlinks, why do you enable urlfilter-regex?
> Anyway, if you set the property parse.filter.urls to false, the Parser
> will not filter outlinks at all.
>
>         Yossi.
>
> -----Original Message-----
> From: Sachin Mittal <[hidden email]>
> Sent: Thursday, 17 October 2019 19:15
> To: [hidden email]
> Subject: Parsed segment has outlinks filtered
>
> Hi,
> I was bit confused on the outlinks generated from a parsed url.
> If I use the utility:
>
> bin/nutch parsechecker url
>
> The generated outlinks has all the outlinks.
>
> However if I check the dump of parsed segment generated using nutch crawl
> script using command:
>
> bin/nutch readseg -dump /segments/<>/ /outputdir -nocontent -nofetch -
> nogenerate -noparse -noparsetext
>
> And I review the same entry's ParseData I see it has lot fewer outlinks.
> Basically it has filtered out all the outlinks which did not match the
> regex's defined in regex-urlfilter.txt.
>
> So I want to know if there is a way to avoid this and make sure the
> generated outlinks in the nutch segments contains all the urls and not just
> the filtered ones.
>
> Even if you can point to the code where this url filtering happens for
> outlinks I can figure out a way to circumvent this.
>
> Thanks
> Sachin
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Parsed segment has outlinks filtered

Sebastian Nagel-2
Hi Sachin,

practically every Nutch tool (inject, generate, fetch, parse, update, index)
can filter (and normalize) URLs. Because filtering and normalizing is expensive
only the steps which add new URLs (inject and parse) do this by default (see
bin/crawl).

For your use case you might instead filter during the generation step
* remove the -noFilter option of the generate command
* add -noFilter to the parse step resp. set parse.filter.urls to false
  as Yossi mentioned

For historical reasons (difficult to change when trying to ensure backwards
compatibility some commands have a -filter argument while others have -noFilter.
In addition, often there are configuration properties to achieve the same.
But command-line args always take precedence.

Best,
Sebastian

On 17.10.19 20:23, Sachin Mittal wrote:

> Hi,
>
> Thanks I figured this out. Lets hope it works!.
>
> urlfilter-regex is required to filter out the urls for next crawl, however
> I still want to index all the outlinks for my current url.
> The reason is that I may not want nutch to crawl these outlinks in next
> round, but I may still want some other crawler to scrape these urls.
>
> Sachin
>
>
> On Thu, Oct 17, 2019 at 10:01 PM <[hidden email]> wrote:
>
>> Hi Sachin,
>>
>> I'm not sure what you are trying to achieve: If you don't want to filter
>> the outlinks, why do you enable urlfilter-regex?
>> Anyway, if you set the property parse.filter.urls to false, the Parser
>> will not filter outlinks at all.
>>
>>         Yossi.
>>
>> -----Original Message-----
>> From: Sachin Mittal <[hidden email]>
>> Sent: Thursday, 17 October 2019 19:15
>> To: [hidden email]
>> Subject: Parsed segment has outlinks filtered
>>
>> Hi,
>> I was bit confused on the outlinks generated from a parsed url.
>> If I use the utility:
>>
>> bin/nutch parsechecker url
>>
>> The generated outlinks has all the outlinks.
>>
>> However if I check the dump of parsed segment generated using nutch crawl
>> script using command:
>>
>> bin/nutch readseg -dump /segments/<>/ /outputdir -nocontent -nofetch -
>> nogenerate -noparse -noparsetext
>>
>> And I review the same entry's ParseData I see it has lot fewer outlinks.
>> Basically it has filtered out all the outlinks which did not match the
>> regex's defined in regex-urlfilter.txt.
>>
>> So I want to know if there is a way to avoid this and make sure the
>> generated outlinks in the nutch segments contains all the urls and not just
>> the filtered ones.
>>
>> Even if you can point to the code where this url filtering happens for
>> outlinks I can figure out a way to circumvent this.
>>
>> Thanks
>> Sachin
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Parsed segment has outlinks filtered

Sachin Mittal
In reply to this post by Sachin Mittal
Hi,
Setting the prop parse.filter.urls= false does not filter out the outlinks.
I get all the outlinks for my parsed url. So this is working as expected.
However it has caused something unwarranted on the FetcherThread as now it
seems to be fetching all the urls (even ones which do not match
urlfilter-regex).
These urls were not fetched earlier. So what it seems to be doing is that
when generating next set of urls, it is not applying urlfilter-regex.

I will play around with noFilter option as Sebastian has mentioned and see
if this works as expected.

However any idea why the next crawl cycle (from previous crawl cycle's
outlinks) does not seem to be applying the url filters defined in
urlfilter-regex

Thanks
Sachin



On Thu, Oct 17, 2019 at 11:53 PM Sachin Mittal <[hidden email]> wrote:

> Hi,
>
> Thanks I figured this out. Lets hope it works!.
>
> urlfilter-regex is required to filter out the urls for next crawl, however
> I still want to index all the outlinks for my current url.
> The reason is that I may not want nutch to crawl these outlinks in next
> round, but I may still want some other crawler to scrape these urls.
>
> Sachin
>
>
> On Thu, Oct 17, 2019 at 10:01 PM <[hidden email]> wrote:
>
>> Hi Sachin,
>>
>> I'm not sure what you are trying to achieve: If you don't want to filter
>> the outlinks, why do you enable urlfilter-regex?
>> Anyway, if you set the property parse.filter.urls to false, the Parser
>> will not filter outlinks at all.
>>
>>         Yossi.
>>
>> -----Original Message-----
>> From: Sachin Mittal <[hidden email]>
>> Sent: Thursday, 17 October 2019 19:15
>> To: [hidden email]
>> Subject: Parsed segment has outlinks filtered
>>
>> Hi,
>> I was bit confused on the outlinks generated from a parsed url.
>> If I use the utility:
>>
>> bin/nutch parsechecker url
>>
>> The generated outlinks has all the outlinks.
>>
>> However if I check the dump of parsed segment generated using nutch crawl
>> script using command:
>>
>> bin/nutch readseg -dump /segments/<>/ /outputdir -nocontent -nofetch -
>> nogenerate -noparse -noparsetext
>>
>> And I review the same entry's ParseData I see it has lot fewer outlinks.
>> Basically it has filtered out all the outlinks which did not match the
>> regex's defined in regex-urlfilter.txt.
>>
>> So I want to know if there is a way to avoid this and make sure the
>> generated outlinks in the nutch segments contains all the urls and not just
>> the filtered ones.
>>
>> Even if you can point to the code where this url filtering happens for
>> outlinks I can figure out a way to circumvent this.
>>
>> Thanks
>> Sachin
>>
>>
Reply | Threaded
Open this post in threaded view
|

RE: Parsed segment has outlinks filtered

Yossi Tamari
Hi Sachin,

If you're using the default crawl script, I think the answer was in Sebastian's email: the default seems to be to filter only in the Parse step. This has changed recently, so the Fetch step now filters as well, but only if you have the latest code. Otherwise, you need to remove the -noFilter flag from generate_args in the crawl script. I missed that, since I don't use this script.
(Generally, always treat Sebastian's answers as The Best Answers!)

        Yossi.

-----Original Message-----
From: Sachin Mittal <[hidden email]>
Sent: Friday, 18 October 2019 17:36
To: [hidden email]
Subject: Re: Parsed segment has outlinks filtered

Hi,
Setting the prop parse.filter.urls= false does not filter out the outlinks.
I get all the outlinks for my parsed url. So this is working as expected.
However it has caused something unwarranted on the FetcherThread as now it seems to be fetching all the urls (even ones which do not match urlfilter-regex).
These urls were not fetched earlier. So what it seems to be doing is that when generating next set of urls, it is not applying urlfilter-regex.

I will play around with noFilter option as Sebastian has mentioned and see if this works as expected.

However any idea why the next crawl cycle (from previous crawl cycle's
outlinks) does not seem to be applying the url filters defined in urlfilter-regex

Thanks
Sachin



On Thu, Oct 17, 2019 at 11:53 PM Sachin Mittal <[hidden email]> wrote:

> Hi,
>
> Thanks I figured this out. Lets hope it works!.
>
> urlfilter-regex is required to filter out the urls for next crawl,
> however I still want to index all the outlinks for my current url.
> The reason is that I may not want nutch to crawl these outlinks in
> next round, but I may still want some other crawler to scrape these urls.
>
> Sachin
>
>
> On Thu, Oct 17, 2019 at 10:01 PM <[hidden email]> wrote:
>
>> Hi Sachin,
>>
>> I'm not sure what you are trying to achieve: If you don't want to
>> filter the outlinks, why do you enable urlfilter-regex?
>> Anyway, if you set the property parse.filter.urls to false, the
>> Parser will not filter outlinks at all.
>>
>>         Yossi.
>>
>> -----Original Message-----
>> From: Sachin Mittal <[hidden email]>
>> Sent: Thursday, 17 October 2019 19:15
>> To: [hidden email]
>> Subject: Parsed segment has outlinks filtered
>>
>> Hi,
>> I was bit confused on the outlinks generated from a parsed url.
>> If I use the utility:
>>
>> bin/nutch parsechecker url
>>
>> The generated outlinks has all the outlinks.
>>
>> However if I check the dump of parsed segment generated using nutch
>> crawl script using command:
>>
>> bin/nutch readseg -dump /segments/<>/ /outputdir -nocontent -nofetch
>> - nogenerate -noparse -noparsetext
>>
>> And I review the same entry's ParseData I see it has lot fewer outlinks.
>> Basically it has filtered out all the outlinks which did not match
>> the regex's defined in regex-urlfilter.txt.
>>
>> So I want to know if there is a way to avoid this and make sure the
>> generated outlinks in the nutch segments contains all the urls and
>> not just the filtered ones.
>>
>> Even if you can point to the code where this url filtering happens
>> for outlinks I can figure out a way to circumvent this.
>>
>> Thanks
>> Sachin
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Parsed segment has outlinks filtered

Sachin Mittal
Yes the changes Sebastian suggested seems to be working fine.
I now see all the outlinks in the parsed document and subsequent crawl of
the outlinks filters out those that do not match my regex-urlfilter.

Thanks
Sachin


On Fri, Oct 18, 2019 at 11:51 PM <[hidden email]> wrote:

> Hi Sachin,
>
> If you're using the default crawl script, I think the answer was in
> Sebastian's email: the default seems to be to filter only in the Parse
> step. This has changed recently, so the Fetch step now filters as well, but
> only if you have the latest code. Otherwise, you need to remove the
> -noFilter flag from generate_args in the crawl script. I missed that, since
> I don't use this script.
> (Generally, always treat Sebastian's answers as The Best Answers!)
>
>         Yossi.
>
> -----Original Message-----
> From: Sachin Mittal <[hidden email]>
> Sent: Friday, 18 October 2019 17:36
> To: [hidden email]
> Subject: Re: Parsed segment has outlinks filtered
>
> Hi,
> Setting the prop parse.filter.urls= false does not filter out the outlinks.
> I get all the outlinks for my parsed url. So this is working as expected.
> However it has caused something unwarranted on the FetcherThread as now it
> seems to be fetching all the urls (even ones which do not match
> urlfilter-regex).
> These urls were not fetched earlier. So what it seems to be doing is that
> when generating next set of urls, it is not applying urlfilter-regex.
>
> I will play around with noFilter option as Sebastian has mentioned and see
> if this works as expected.
>
> However any idea why the next crawl cycle (from previous crawl cycle's
> outlinks) does not seem to be applying the url filters defined in
> urlfilter-regex
>
> Thanks
> Sachin
>
>
>
> On Thu, Oct 17, 2019 at 11:53 PM Sachin Mittal <[hidden email]> wrote:
>
> > Hi,
> >
> > Thanks I figured this out. Lets hope it works!.
> >
> > urlfilter-regex is required to filter out the urls for next crawl,
> > however I still want to index all the outlinks for my current url.
> > The reason is that I may not want nutch to crawl these outlinks in
> > next round, but I may still want some other crawler to scrape these urls.
> >
> > Sachin
> >
> >
> > On Thu, Oct 17, 2019 at 10:01 PM <[hidden email]> wrote:
> >
> >> Hi Sachin,
> >>
> >> I'm not sure what you are trying to achieve: If you don't want to
> >> filter the outlinks, why do you enable urlfilter-regex?
> >> Anyway, if you set the property parse.filter.urls to false, the
> >> Parser will not filter outlinks at all.
> >>
> >>         Yossi.
> >>
> >> -----Original Message-----
> >> From: Sachin Mittal <[hidden email]>
> >> Sent: Thursday, 17 October 2019 19:15
> >> To: [hidden email]
> >> Subject: Parsed segment has outlinks filtered
> >>
> >> Hi,
> >> I was bit confused on the outlinks generated from a parsed url.
> >> If I use the utility:
> >>
> >> bin/nutch parsechecker url
> >>
> >> The generated outlinks has all the outlinks.
> >>
> >> However if I check the dump of parsed segment generated using nutch
> >> crawl script using command:
> >>
> >> bin/nutch readseg -dump /segments/<>/ /outputdir -nocontent -nofetch
> >> - nogenerate -noparse -noparsetext
> >>
> >> And I review the same entry's ParseData I see it has lot fewer outlinks.
> >> Basically it has filtered out all the outlinks which did not match
> >> the regex's defined in regex-urlfilter.txt.
> >>
> >> So I want to know if there is a way to avoid this and make sure the
> >> generated outlinks in the nutch segments contains all the urls and
> >> not just the filtered ones.
> >>
> >> Even if you can point to the code where this url filtering happens
> >> for outlinks I can figure out a way to circumvent this.
> >>
> >> Thanks
> >> Sachin
> >>
> >>
>
>