Crawling with nutch, check Links

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Crawling with nutch, check Links

d.kumar@technisat.de
Hey,

currently I'm working on nutch with solr for our company pages.

Assuming the following situation:
We have a website:

www.mysite.lol<http://www.mysite.lol>

at this site there is a Link:
www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1564/>

As you can see there is a type I should be /testpage/:

www.mysite.lol/testpage/3512-1564/<http://www.mysite.lol/testpage/3512-1564/>

As our Framework doesn't care about the text before the ID, we could type everything we want and the site will be displayed because of the id. That is why both link are fine and there is no 404.
If I change the link from the mainpage to the correct one, let nutch crawl the site again, an send is to solr, the old one is still found.

So the link
www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1564/>
is still at the nutch db, because the link is valid --> no 404. But there is no mainpage pointing to this website. How do I tell nutch to ignore sites, which doesn't have a link to it.
Basically --> revalidating links and removing site without links to it?



Mit freundlichen Grüßen
David Kumar

Senior Software Engineer Java, B. Sc.
Projektmanager PIM
Abteilung Infotech
TechniSat Digital GmbH
Julius-Saxler-Straße 3
TechniPark
D-54550 Daun / Germany

Tel.: + 49 (0) 6592 / 712 -2826
Fax: + 49 (0) 6592 / 712 -2829

www.technisat.com/de_DE/<http://www.technisat.com/de_DE/>
www.facebook.com/technisat

Reply | Threaded
Open this post in threaded view
|

Re: Crawling with nutch, check Links

Sebastian Nagel
Hi David,

the easiest way is to delete the CrawlDb and to start the crawl from scratch.
Since it's a site crawl this should be possible, at least, from time to time.
Then delete documents from the index which haven't been updated.

A more sophisticated solution is not yet ready, see
  https://issues.apache.org/jira/browse/NUTCH-1932

Best,
Sebastian

On 07/27/2017 10:11 AM, [hidden email] wrote:

> Hey,
>
> currently I'm working on nutch with solr for our company pages.
>
> Assuming the following situation:
> We have a website:
>
> www.mysite.lol<http://www.mysite.lol>
>
> at this site there is a Link:
> www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1564/>
>
> As you can see there is a type I should be /testpage/:
>
> www.mysite.lol/testpage/3512-1564/<http://www.mysite.lol/testpage/3512-1564/>
>
> As our Framework doesn't care about the text before the ID, we could type everything we want and the site will be displayed because of the id. That is why both link are fine and there is no 404.
> If I change the link from the mainpage to the correct one, let nutch crawl the site again, an send is to solr, the old one is still found.
>
> So the link
> www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1564/>
> is still at the nutch db, because the link is valid --> no 404. But there is no mainpage pointing to this website. How do I tell nutch to ignore sites, which doesn't have a link to it.
> Basically --> revalidating links and removing site without links to it?
>
>
>
> Mit freundlichen Grüßen
> David Kumar
>
> Senior Software Engineer Java, B. Sc.
> Projektmanager PIM
> Abteilung Infotech
> TechniSat Digital GmbH
> Julius-Saxler-Straße 3
> TechniPark
> D-54550 Daun / Germany
>
> Tel.: + 49 (0) 6592 / 712 -2826
> Fax: + 49 (0) 6592 / 712 -2829
>
> www.technisat.com/de_DE/<http://www.technisat.com/de_DE/>
> www.facebook.com/technisat
>
>

Reply | Threaded
Open this post in threaded view
|

AW: Crawling with nutch, check Links

d.kumar@technisat.de
Hey Sebastian,


thanks. What I did so far is: delete the database and start a whole new crawl.
I saw that jira with orphaned pages, before. That is exactly, what I'm looking for: as the ticket is more than 2 years old, I assume it won't be fixed.. :-(

Thanks

David


-----Ursprüngliche Nachricht-----
Von: Sebastian Nagel [mailto:[hidden email]]
Gesendet: Freitag, 28. Juli 2017 12:09
An: [hidden email]
Betreff: Re: Crawling with nutch, check Links

Hi David,

the easiest way is to delete the CrawlDb and to start the crawl from scratch.
Since it's a site crawl this should be possible, at least, from time to time.
Then delete documents from the index which haven't been updated.

A more sophisticated solution is not yet ready, see
  https://issues.apache.org/jira/browse/NUTCH-1932

Best,
Sebastian

On 07/27/2017 10:11 AM, [hidden email] wrote:

> Hey,
>
> currently I'm working on nutch with solr for our company pages.
>
> Assuming the following situation:
> We have a website:
>
> www.mysite.lol<http://www.mysite.lol>
>
> at this site there is a Link:
> www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1
> 564/>
>
> As you can see there is a type I should be /testpage/:
>
> www.mysite.lol/testpage/3512-1564/<http://www.mysite.lol/testpage/3512
> -1564/>
>
> As our Framework doesn't care about the text before the ID, we could type everything we want and the site will be displayed because of the id. That is why both link are fine and there is no 404.
> If I change the link from the mainpage to the correct one, let nutch crawl the site again, an send is to solr, the old one is still found.
>
> So the link
> www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1
> 564/> is still at the nutch db, because the link is valid --> no 404.
> But there is no mainpage pointing to this website. How do I tell nutch to ignore sites, which doesn't have a link to it.
> Basically --> revalidating links and removing site without links to it?
>
>
>
> Mit freundlichen Grüßen
> David Kumar
>
> Senior Software Engineer Java, B. Sc.
> Projektmanager PIM
> Abteilung Infotech
> TechniSat Digital GmbH
> Julius-Saxler-Straße 3
> TechniPark
> D-54550 Daun / Germany
>
> Tel.: + 49 (0) 6592 / 712 -2826
> Fax: + 49 (0) 6592 / 712 -2829
>
> www.technisat.com/de_DE/<http://www.technisat.com/de_DE/>
> www.facebook.com/technisat
>
>

Reply | Threaded
Open this post in threaded view
|

Re: AW: Crawling with nutch, check Links

Sebastian Nagel
> as the ticket is more than 2 years old, I assume it won't be fixed.. :-(

Not necessarily. Other features got in after more than two years ;)

On 07/31/2017 07:32 AM, [hidden email] wrote:

> Hey Sebastian,
>
>
> thanks. What I did so far is: delete the database and start a whole new crawl.
> I saw that jira with orphaned pages, before. That is exactly, what I'm looking for: as the ticket is more than 2 years old, I assume it won't be fixed.. :-(
>
> Thanks
>
> David
>
>
> -----Ursprüngliche Nachricht-----
> Von: Sebastian Nagel [mailto:[hidden email]]
> Gesendet: Freitag, 28. Juli 2017 12:09
> An: [hidden email]
> Betreff: Re: Crawling with nutch, check Links
>
> Hi David,
>
> the easiest way is to delete the CrawlDb and to start the crawl from scratch.
> Since it's a site crawl this should be possible, at least, from time to time.
> Then delete documents from the index which haven't been updated.
>
> A more sophisticated solution is not yet ready, see
>   https://issues.apache.org/jira/browse/NUTCH-1932
>
> Best,
> Sebastian
>
> On 07/27/2017 10:11 AM, [hidden email] wrote:
>> Hey,
>>
>> currently I'm working on nutch with solr for our company pages.
>>
>> Assuming the following situation:
>> We have a website:
>>
>> www.mysite.lol<http://www.mysite.lol>
>>
>> at this site there is a Link:
>> www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1
>> 564/>
>>
>> As you can see there is a type I should be /testpage/:
>>
>> www.mysite.lol/testpage/3512-1564/<http://www.mysite.lol/testpage/3512
>> -1564/>
>>
>> As our Framework doesn't care about the text before the ID, we could type everything we want and the site will be displayed because of the id. That is why both link are fine and there is no 404.
>> If I change the link from the mainpage to the correct one, let nutch crawl the site again, an send is to solr, the old one is still found.
>>
>> So the link
>> www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1
>> 564/> is still at the nutch db, because the link is valid --> no 404.
>> But there is no mainpage pointing to this website. How do I tell nutch to ignore sites, which doesn't have a link to it.
>> Basically --> revalidating links and removing site without links to it?
>>
>>
>>
>> Mit freundlichen Grüßen
>> David Kumar
>>
>> Senior Software Engineer Java, B. Sc.
>> Projektmanager PIM
>> Abteilung Infotech
>> TechniSat Digital GmbH
>> Julius-Saxler-Straße 3
>> TechniPark
>> D-54550 Daun / Germany
>>
>> Tel.: + 49 (0) 6592 / 712 -2826
>> Fax: + 49 (0) 6592 / 712 -2829
>>
>> www.technisat.com/de_DE/<http://www.technisat.com/de_DE/>
>> www.facebook.com/technisat
>>
>>
>