Not valid URLs in Crawldb through crawlcomplete

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Not valid URLs in Crawldb through crawlcomplete

Semyon Semyonov
Hello all,

I have launched a crawling process for 100 websites with external links equals to true.
After several hours, I run the crawlcomplete command with mode equals host.

The crawlcomplete output file contains(apart from the proper host names) the following lines.

1    #Are there any places to eat onsite during the show#Are there any places to eat onsite during the show UNFETCHED
1    #Are there any points where I can access the internet at the show#Are there any points where I can access the internet at the show UNFETCHED
1    #Can I register onsite#Can I register onsite UNFETCHED
1    #Can children attend the show#Can children attend the show UNFETCHED
1    #Can you recommend any site-seeing attractions in Amsterdam#Can you recommend any site-seeing attractions in Amsterdam UNFETCHED
1    #Do I need a visa#Do I need a visa UNFETCHED
1    #How do I get to IBC2018 at the Amsterdam RAI#How do I get to IBC2018 at the Amsterdam RAI UNFETCHED
1    #Is there anywhere for me to practice my religion#Is there anywhere for me to practice my religion UNFETCHED
1    #Is there parking#Is there parking UNFETCHED
1    #Want to exhibit at IBC2018#Want to exhibit at IBC2018 UNFETCHED
1    #What do I have access to at IBC#What do I have access to at IBC UNFETCHED
1    #What do I need to bring to IBC#What do I need to bring to IBC UNFETCHED
1    #What is the IBC Big Screen Experience#What is the IBC Big Screen Experience UNFETCHED
1    #When and where is IBC#When and where is IBC UNFETCHED
1    #Who attends IBC#Who attends IBC UNFETCHED

After googling I found the webpage where it came from:
https://show.ibc.org/about-ibc/faqs

It seems like Nutch takes the anchor name as an URL for the crawling a store it in database with the key equals to name.

For example.
<a class="anchor" name="Are there any places to eat onsite during the show?"></a>

Any suggestion what is it and how to fix it?
Thanks.

Semyon.
Reply | Threaded
Open this post in threaded view
|

Re: Not valid URLs in Crawldb through crawlcomplete

Sebastian Nagel
Hi Semyon,

> It seems like Nutch takes the anchor name as an URL for the crawling a store it in database with
the key equals to name.

if you look into the page HTML you can see that it's the href attribute:

     <p><a href="<a href="http://#Are">http://#Are there any places to eat onsite during the show?" target="_self"
title="<a href="http://#Are">http://#Are there any places to eat onsite during the show?">Are there any places to eat
onsite during the show?</a></p>


How are URL filters configured?  Normally, a URL
"<a href="http://#Are">http://#Are there any places to eat onsite during the show?"
should not make it into the CrawlDb.

Best,
Sebastian

On 11/28/2017 02:17 PM, Semyon Semyonov wrote:

> Hello all,
>
> I have launched a crawling process for 100 websites with external links equals to true.
> After several hours, I run the crawlcomplete command with mode equals host.
>
> The crawlcomplete output file contains(apart from the proper host names) the following lines.
>
> 1    #Are there any places to eat onsite during the show#Are there any places to eat onsite during the show UNFETCHED
> 1    #Are there any points where I can access the internet at the show#Are there any points where I can access the internet at the show UNFETCHED
> 1    #Can I register onsite#Can I register onsite UNFETCHED
> 1    #Can children attend the show#Can children attend the show UNFETCHED
> 1    #Can you recommend any site-seeing attractions in Amsterdam#Can you recommend any site-seeing attractions in Amsterdam UNFETCHED
> 1    #Do I need a visa#Do I need a visa UNFETCHED
> 1    #How do I get to IBC2018 at the Amsterdam RAI#How do I get to IBC2018 at the Amsterdam RAI UNFETCHED
> 1    #Is there anywhere for me to practice my religion#Is there anywhere for me to practice my religion UNFETCHED
> 1    #Is there parking#Is there parking UNFETCHED
> 1    #Want to exhibit at IBC2018#Want to exhibit at IBC2018 UNFETCHED
> 1    #What do I have access to at IBC#What do I have access to at IBC UNFETCHED
> 1    #What do I need to bring to IBC#What do I need to bring to IBC UNFETCHED
> 1    #What is the IBC Big Screen Experience#What is the IBC Big Screen Experience UNFETCHED
> 1    #When and where is IBC#When and where is IBC UNFETCHED
> 1    #Who attends IBC#Who attends IBC UNFETCHED
>
> After googling I found the webpage where it came from:
> https://show.ibc.org/about-ibc/faqs
>
> It seems like Nutch takes the anchor name as an URL for the crawling a store it in database with the key equals to name.
>
> For example.
> <a class="anchor" name="Are there any places to eat onsite during the show?"></a>
>
> Any suggestion what is it and how to fix it?
> Thanks.
>
> Semyon.
>

Reply | Threaded
Open this post in threaded view
|

Re: Not valid URLs in Crawldb through crawlcomplete

Semyon Semyonov
Hi Sebastian,

We didn't set up the URL filters.
Could you let me know the way to specify them(is it a file with urlfilters + plugin, right?) and maybe advice me a default filter that filters such problematic urls?

Thanks.

Semyon.


Sent: Tuesday, November 28, 2017 at 4:17 PM
From: "Sebastian Nagel" <[hidden email]>
To: [hidden email]
Subject: Re: Not valid URLs in Crawldb through crawlcomplete
Hi Semyon,

> It seems like Nutch takes the anchor name as an URL for the crawling a store it in database with
the key equals to name.

if you look into the page HTML you can see that it's the href attribute:

<p><a href="<a href="http://#Are">http://#Are there any places to eat onsite during the show?" target="_self"
title="<a href="http://#Are">http://#Are there any places to eat onsite during the show?">Are there any places to eat
onsite during the show?</a></p>


How are URL filters configured? Normally, a URL
"<a href="http://#Are">http://#Are there any places to eat onsite during the show?"
should not make it into the CrawlDb.

Best,
Sebastian

On 11/28/2017 02:17 PM, Semyon Semyonov wrote:

> Hello all,
>
> I have launched a crawling process for 100 websites with external links equals to true.
> After several hours, I run the crawlcomplete command with mode equals host.
>
> The crawlcomplete output file contains(apart from the proper host names) the following lines.
>
> 1 #Are there any places to eat onsite during the show#Are there any places to eat onsite during the show UNFETCHED
> 1 #Are there any points where I can access the internet at the show#Are there any points where I can access the internet at the show UNFETCHED
> 1 #Can I register onsite#Can I register onsite UNFETCHED
> 1 #Can children attend the show#Can children attend the show UNFETCHED
> 1 #Can you recommend any site-seeing attractions in Amsterdam#Can you recommend any site-seeing attractions in Amsterdam UNFETCHED
> 1 #Do I need a visa#Do I need a visa UNFETCHED
> 1 #How do I get to IBC2018 at the Amsterdam RAI#How do I get to IBC2018 at the Amsterdam RAI UNFETCHED
> 1 #Is there anywhere for me to practice my religion#Is there anywhere for me to practice my religion UNFETCHED
> 1 #Is there parking#Is there parking UNFETCHED
> 1 #Want to exhibit at IBC2018#Want to exhibit at IBC2018 UNFETCHED
> 1 #What do I have access to at IBC#What do I have access to at IBC UNFETCHED
> 1 #What do I need to bring to IBC#What do I need to bring to IBC UNFETCHED
> 1 #What is the IBC Big Screen Experience#What is the IBC Big Screen Experience UNFETCHED
> 1 #When and where is IBC#When and where is IBC UNFETCHED
> 1 #Who attends IBC#Who attends IBC UNFETCHED
>
> After googling I found the webpage where it came from:
> https://show.ibc.org/about-ibc/faqs
>
> It seems like Nutch takes the anchor name as an URL for the crawling a store it in database with the key equals to name.
>
> For example.
> <a class="anchor" name="Are there any places to eat onsite during the show?"></a>
>
> Any suggestion what is it and how to fix it?
> Thanks.
>
> Semyon.
>
 
Reply | Threaded
Open this post in threaded view
|

Re: Not valid URLs in Crawldb through crawlcomplete

Sebastian Nagel
Hi,

all 8 available urlfilter-* plugins are linked from the API doc page
   https://builds.apache.org/job/nutch-trunk/javadoc/

Activate those you need in the property plugin.includes.

Most of the urlfilter plugins have a specific configuration file which
must be adapted to your needs.

For the specific problem it's to just activate urlfilter-validator.

Best,
Sebastian

On 11/29/2017 09:21 AM, Semyon Semyonov wrote:

> Hi Sebastian,
>
> We didn't set up the URL filters.
> Could you let me know the way to specify them(is it a file with urlfilters + plugin, right?) and maybe advice me a default filter that filters such problematic urls?
>
> Thanks.
>
> Semyon.
>
>
> Sent: Tuesday, November 28, 2017 at 4:17 PM
> From: "Sebastian Nagel" <[hidden email]>
> To: [hidden email]
> Subject: Re: Not valid URLs in Crawldb through crawlcomplete
> Hi Semyon,
>
>> It seems like Nutch takes the anchor name as an URL for the crawling a store it in database with
> the key equals to name.
>
> if you look into the page HTML you can see that it's the href attribute:
>
> <p><a href="<a href="http://#Are">http://#Are there any places to eat onsite during the show?" target="_self"
> title="<a href="http://#Are">http://#Are there any places to eat onsite during the show?">Are there any places to eat
> onsite during the show?</a></p>
>
>
> How are URL filters configured? Normally, a URL
> "<a href="http://#Are">http://#Are there any places to eat onsite during the show?"
> should not make it into the CrawlDb.
>
> Best,
> Sebastian
>
> On 11/28/2017 02:17 PM, Semyon Semyonov wrote:
>> Hello all,
>>
>> I have launched a crawling process for 100 websites with external links equals to true.
>> After several hours, I run the crawlcomplete command with mode equals host.
>>
>> The crawlcomplete output file contains(apart from the proper host names) the following lines.
>>
>> 1 #Are there any places to eat onsite during the show#Are there any places to eat onsite during the show UNFETCHED
>> 1 #Are there any points where I can access the internet at the show#Are there any points where I can access the internet at the show UNFETCHED
>> 1 #Can I register onsite#Can I register onsite UNFETCHED
>> 1 #Can children attend the show#Can children attend the show UNFETCHED
>> 1 #Can you recommend any site-seeing attractions in Amsterdam#Can you recommend any site-seeing attractions in Amsterdam UNFETCHED
>> 1 #Do I need a visa#Do I need a visa UNFETCHED
>> 1 #How do I get to IBC2018 at the Amsterdam RAI#How do I get to IBC2018 at the Amsterdam RAI UNFETCHED
>> 1 #Is there anywhere for me to practice my religion#Is there anywhere for me to practice my religion UNFETCHED
>> 1 #Is there parking#Is there parking UNFETCHED
>> 1 #Want to exhibit at IBC2018#Want to exhibit at IBC2018 UNFETCHED
>> 1 #What do I have access to at IBC#What do I have access to at IBC UNFETCHED
>> 1 #What do I need to bring to IBC#What do I need to bring to IBC UNFETCHED
>> 1 #What is the IBC Big Screen Experience#What is the IBC Big Screen Experience UNFETCHED
>> 1 #When and where is IBC#When and where is IBC UNFETCHED
>> 1 #Who attends IBC#Who attends IBC UNFETCHED
>>
>> After googling I found the webpage where it came from:
>> https://show.ibc.org/about-ibc/faqs
>>
>> It seems like Nutch takes the anchor name as an URL for the crawling a store it in database with the key equals to name.
>>
>> For example.
>> <a class="anchor" name="Are there any places to eat onsite during the show?"></a>
>>
>> Any suggestion what is it and how to fix it?
>> Thanks.
>>
>> Semyon.
>>
>  
>

Reply | Threaded
Open this post in threaded view
|

Re: Not valid URLs in Crawldb through crawlcomplete

Michael Coffey
I bet that problem affects a lot of people. It certainly has affected me.

Why isn't essential filtering ON by default?

The bin/crawl script doesn't even have a way for the operator to specify any filltering. And nowhere, in the tutorial, is it mentioned that you need to specify "-filter" to updatedb to make it work.


      From: Sebastian Nagel <[hidden email]>
 To: [hidden email]
 Sent: Wednesday, November 29, 2017 2:40 AM
 Subject: Re: Not valid URLs in Crawldb through crawlcomplete
   
Hi,

all 8 available urlfilter-* plugins are linked from the API doc page
  https://builds.apache.org/job/nutch-trunk/javadoc/

Activate those you need in the property plugin.includes.

Most of the urlfilter plugins have a specific configuration file which
must be adapted to your needs.

For the specific problem it's to just activate urlfilter-validator.

Best,
Sebastian

On 11/29/2017 09:21 AM, Semyon Semyonov wrote:

> Hi Sebastian,
>
> We didn't set up the URL filters.
> Could you let me know the way to specify them(is it a file with urlfilters + plugin, right?) and maybe advice me a default filter that filters such problematic urls?
>
> Thanks.
>
> Semyon.
>
>
> Sent: Tuesday, November 28, 2017 at 4:17 PM
> From: "Sebastian Nagel" <[hidden email]>
> To: [hidden email]
> Subject: Re: Not valid URLs in Crawldb through crawlcomplete
> Hi Semyon,
>
>> It seems like Nutch takes the anchor name as an URL for the crawling a store it in database with
> the key equals to name.
>
> if you look into the page HTML you can see that it's the href attribute:
>
> <p><a href="<a href="http://#Are">http://#Are there any places to eat onsite during the show?" target="_self"
> title="<a href="http://#Are">http://#Are there any places to eat onsite during the show?">Are there any places to eat
> onsite during the show?</a></p>
>
>
> How are URL filters configured? Normally, a URL
> "<a href="http://#Are">http://#Are there any places to eat onsite during the show?"
> should not make it into the CrawlDb.
>
> Best,
> Sebastian
>
> On 11/28/2017 02:17 PM, Semyon Semyonov wrote:
>> Hello all,
>>
>> I have launched a crawling process for 100 websites with external links equals to true.
>> After several hours, I run the crawlcomplete command with mode equals host.
>>
>> The crawlcomplete output file contains(apart from the proper host names) the following lines.
>>
>> 1 #Are there any places to eat onsite during the show#Are there any places to eat onsite during the show UNFETCHED
>> 1 #Are there any points where I can access the internet at the show#Are there any points where I can access the internet at the show UNFETCHED
>> 1 #Can I register onsite#Can I register onsite UNFETCHED
>> 1 #Can children attend the show#Can children attend the show UNFETCHED
>> 1 #Can you recommend any site-seeing attractions in Amsterdam#Can you recommend any site-seeing attractions in Amsterdam UNFETCHED
>> 1 #Do I need a visa#Do I need a visa UNFETCHED
>> 1 #How do I get to IBC2018 at the Amsterdam RAI#How do I get to IBC2018 at the Amsterdam RAI UNFETCHED
>> 1 #Is there anywhere for me to practice my religion#Is there anywhere for me to practice my religion UNFETCHED
>> 1 #Is there parking#Is there parking UNFETCHED
>> 1 #Want to exhibit at IBC2018#Want to exhibit at IBC2018 UNFETCHED
>> 1 #What do I have access to at IBC#What do I have access to at IBC UNFETCHED
>> 1 #What do I need to bring to IBC#What do I need to bring to IBC UNFETCHED
>> 1 #What is the IBC Big Screen Experience#What is the IBC Big Screen Experience UNFETCHED
>> 1 #When and where is IBC#When and where is IBC UNFETCHED
>> 1 #Who attends IBC#Who attends IBC UNFETCHED
>>
>> After googling I found the webpage where it came from:
>> https://show.ibc.org/about-ibc/faqs
>>
>> It seems like Nutch takes the anchor name as an URL for the crawling a store it in database with the key equals to name.
>>
>> For example.
>> <a class="anchor" name="Are there any places to eat onsite during the show?"></a>
>>
>> Any suggestion what is it and how to fix it?
>> Thanks.
>>
>> Semyon.
>>
>  
>



   
Reply | Threaded
Open this post in threaded view
|

Re: Not valid URLs in Crawldb through crawlcomplete

Sebastian Nagel
> Why isn't essential filtering ON by default?

Good question. Per default only urlfilter-regex is active.
That has been the case since long. I think it's better not
to load first users with the need to configure multiple filters.

But adding urlfilter-validator might be a good idea. Feel free to
open a Jira issue for this.

> And nowhere, in the tutorial, is it mentioned that you need to specify "-filter"
> to updatedb to make it work.

No. You don't have to. By default filters are only applied to:
- injected URLs
- outlinks during parsing
- redirects (if fetcher follows redirects)
It's most efficient not filter the CrawlDb. It's costly to apply the filters
again and again: the CrawlDb might be huge (up to billions of URLs),
and/or filter rules can be complex. The default does the necessary but avoid
unnecessary work.


Best,
Sebastian



On 11/29/2017 05:07 PM, Michael Coffey wrote:

> I bet that problem affects a lot of people. It certainly has affected me.
>
> Why isn't essential filtering ON by default?
>
> The bin/crawl script doesn't even have a way for the operator to specify any filltering. And nowhere, in the tutorial, is it mentioned that you need to specify "-filter" to updatedb to make it work.
>
>
>       From: Sebastian Nagel <[hidden email]>
>  To: [hidden email]
>  Sent: Wednesday, November 29, 2017 2:40 AM
>  Subject: Re: Not valid URLs in Crawldb through crawlcomplete
>    
> Hi,
>
> all 8 available urlfilter-* plugins are linked from the API doc page
>   https://builds.apache.org/job/nutch-trunk/javadoc/
>
> Activate those you need in the property plugin.includes.
>
> Most of the urlfilter plugins have a specific configuration file which
> must be adapted to your needs.
>
> For the specific problem it's to just activate urlfilter-validator.
>
> Best,
> Sebastian
>
> On 11/29/2017 09:21 AM, Semyon Semyonov wrote:
>> Hi Sebastian,
>>
>> We didn't set up the URL filters.
>> Could you let me know the way to specify them(is it a file with urlfilters + plugin, right?) and maybe advice me a default filter that filters such problematic urls?
>>
>> Thanks.
>>
>> Semyon.
>>
>>
>> Sent: Tuesday, November 28, 2017 at 4:17 PM
>> From: "Sebastian Nagel" <[hidden email]>
>> To: [hidden email]
>> Subject: Re: Not valid URLs in Crawldb through crawlcomplete
>> Hi Semyon,
>>
>>> It seems like Nutch takes the anchor name as an URL for the crawling a store it in database with
>> the key equals to name.
>>
>> if you look into the page HTML you can see that it's the href attribute:
>>
>> <p><a href="<a href="http://#Are">http://#Are there any places to eat onsite during the show?" target="_self"
>> title="<a href="http://#Are">http://#Are there any places to eat onsite during the show?">Are there any places to eat
>> onsite during the show?</a></p>
>>
>>
>> How are URL filters configured? Normally, a URL
>> "<a href="http://#Are">http://#Are there any places to eat onsite during the show?"
>> should not make it into the CrawlDb.
>>
>> Best,
>> Sebastian
>>
>> On 11/28/2017 02:17 PM, Semyon Semyonov wrote:
>>> Hello all,
>>>
>>> I have launched a crawling process for 100 websites with external links equals to true.
>>> After several hours, I run the crawlcomplete command with mode equals host.
>>>
>>> The crawlcomplete output file contains(apart from the proper host names) the following lines.
>>>
>>> 1 #Are there any places to eat onsite during the show#Are there any places to eat onsite during the show UNFETCHED
>>> 1 #Are there any points where I can access the internet at the show#Are there any points where I can access the internet at the show UNFETCHED
>>> 1 #Can I register onsite#Can I register onsite UNFETCHED
>>> 1 #Can children attend the show#Can children attend the show UNFETCHED
>>> 1 #Can you recommend any site-seeing attractions in Amsterdam#Can you recommend any site-seeing attractions in Amsterdam UNFETCHED
>>> 1 #Do I need a visa#Do I need a visa UNFETCHED
>>> 1 #How do I get to IBC2018 at the Amsterdam RAI#How do I get to IBC2018 at the Amsterdam RAI UNFETCHED
>>> 1 #Is there anywhere for me to practice my religion#Is there anywhere for me to practice my religion UNFETCHED
>>> 1 #Is there parking#Is there parking UNFETCHED
>>> 1 #Want to exhibit at IBC2018#Want to exhibit at IBC2018 UNFETCHED
>>> 1 #What do I have access to at IBC#What do I have access to at IBC UNFETCHED
>>> 1 #What do I need to bring to IBC#What do I need to bring to IBC UNFETCHED
>>> 1 #What is the IBC Big Screen Experience#What is the IBC Big Screen Experience UNFETCHED
>>> 1 #When and where is IBC#When and where is IBC UNFETCHED
>>> 1 #Who attends IBC#Who attends IBC UNFETCHED
>>>
>>> After googling I found the webpage where it came from:
>>> https://show.ibc.org/about-ibc/faqs
>>>
>>> It seems like Nutch takes the anchor name as an URL for the crawling a store it in database with the key equals to name.
>>>
>>> For example.
>>> <a class="anchor" name="Are there any places to eat onsite during the show?"></a>
>>>
>>> Any suggestion what is it and how to fix it?
>>> Thanks.
>>>
>>> Semyon.
>>>
>>  
>>
>
>
>
>    
>

Reply | Threaded
Open this post in threaded view
|

Re: Not valid URLs in Crawldb through crawlcomplete

Michael Coffey
OK, I filed an issue https://issues.apache.org/jira/browse/NUTCH-2468




________________________________
From: Sebastian Nagel <[hidden email]>
To: [hidden email]
Sent: Wednesday, November 29, 2017 9:04 AM
Subject: Re: Not valid URLs in Crawldb through crawlcomplete



> Why isn't essential filtering ON by default?

Good question. Per default only urlfilter-regex is active.
That has been the case since long. I think it's better not
to load first users with the need to configure multiple filters.

But adding urlfilter-validator might be a good idea. Feel free to
open a Jira issue for this.

> And nowhere, in the tutorial, is it mentioned that you need to specify "-filter"
> to updatedb to make it work.

No. You don't have to. By default filters are only applied to:
- injected URLs
- outlinks during parsing
- redirects (if fetcher follows redirects)
It's most efficient not filter the CrawlDb. It's costly to apply the filters
again and again: the CrawlDb might be huge (up to billions of URLs),
and/or filter rules can be complex. The default does the necessary but avoid
unnecessary work.


Best,
Sebastian




On 11/29/2017 05:07 PM, Michael Coffey wrote:

> I bet that problem affects a lot of people. It certainly has affected me.
>
> Why isn't essential filtering ON by default?
>
> The bin/crawl script doesn't even have a way for the operator to specify any filltering. And nowhere, in the tutorial, is it mentioned that you need to specify "-filter" to updatedb to make it work.
>
>
>       From: Sebastian Nagel <[hidden email]>
>  To: [hidden email]
>  Sent: Wednesday, November 29, 2017 2:40 AM
>  Subject: Re: Not valid URLs in Crawldb through crawlcomplete
>    
> Hi,
>
> all 8 available urlfilter-* plugins are linked from the API doc page
>   https://builds.apache.org/job/nutch-trunk/javadoc/
>
> Activate those you need in the property plugin.includes.
>
> Most of the urlfilter plugins have a specific configuration file which
> must be adapted to your needs.
>
> For the specific problem it's to just activate urlfilter-validator.
>
> Best,
> Sebastian
>
> On 11/29/2017 09:21 AM, Semyon Semyonov wrote:
>> Hi Sebastian,
>>
>> We didn't set up the URL filters.
>> Could you let me know the way to specify them(is it a file with urlfilters + plugin, right?) and maybe advice me a default filter that filters such problematic urls?
>>
>> Thanks.
>>
>> Semyon.
>>
>>
>> Sent: Tuesday, November 28, 2017 at 4:17 PM
>> From: "Sebastian Nagel" <[hidden email]>
>> To: [hidden email]
>> Subject: Re: Not valid URLs in Crawldb through crawlcomplete
>> Hi Semyon,
>>
>>> It seems like Nutch takes the anchor name as an URL for the crawling a store it in database with
>> the key equals to name.
>>
>> if you look into the page HTML you can see that it's the href attribute:
>>
>> <p><a href="<a href="http://#Are">http://#Are there any places to eat onsite during the show?" target="_self"
>> title="<a href="http://#Are">http://#Are there any places to eat onsite during the show?">Are there any places to eat
>> onsite during the show?</a></p>
>>
>>
>> How are URL filters configured? Normally, a URL
>> "<a href="http://#Are">http://#Are there any places to eat onsite during the show?"
>> should not make it into the CrawlDb.
>>
>> Best,
>> Sebastian
>>
>> On 11/28/2017 02:17 PM, Semyon Semyonov wrote:
>>> Hello all,
>>>
>>> I have launched a crawling process for 100 websites with external links equals to true.
>>> After several hours, I run the crawlcomplete command with mode equals host.
>>>
>>> The crawlcomplete output file contains(apart from the proper host names) the following lines.
>>>
>>> 1 #Are there any places to eat onsite during the show#Are there any places to eat onsite during the show UNFETCHED
>>> 1 #Are there any points where I can access the internet at the show#Are there any points where I can access the internet at the show UNFETCHED
>>> 1 #Can I register onsite#Can I register onsite UNFETCHED
>>> 1 #Can children attend the show#Can children attend the show UNFETCHED
>>> 1 #Can you recommend any site-seeing attractions in Amsterdam#Can you recommend any site-seeing attractions in Amsterdam UNFETCHED
>>> 1 #Do I need a visa#Do I need a visa UNFETCHED
>>> 1 #How do I get to IBC2018 at the Amsterdam RAI#How do I get to IBC2018 at the Amsterdam RAI UNFETCHED
>>> 1 #Is there anywhere for me to practice my religion#Is there anywhere for me to practice my religion UNFETCHED
>>> 1 #Is there parking#Is there parking UNFETCHED
>>> 1 #Want to exhibit at IBC2018#Want to exhibit at IBC2018 UNFETCHED
>>> 1 #What do I have access to at IBC#What do I have access to at IBC UNFETCHED
>>> 1 #What do I need to bring to IBC#What do I need to bring to IBC UNFETCHED
>>> 1 #What is the IBC Big Screen Experience#What is the IBC Big Screen Experience UNFETCHED
>>> 1 #When and where is IBC#When and where is IBC UNFETCHED
>>> 1 #Who attends IBC#Who attends IBC UNFETCHED
>>>
>>> After googling I found the webpage where it came from:
>>> https://show.ibc.org/about-ibc/faqs
>>>
>>> It seems like Nutch takes the anchor name as an URL for the crawling a store it in database with the key equals to name.
>>>
>>> For example.
>>> <a class="anchor" name="Are there any places to eat onsite during the show?"></a>
>>>
>>> Any suggestion what is it and how to fix it?
>>> Thanks.
>>>
>>> Semyon.
>>>
>>  
>>
>
>
>
>    
>