NOINDEX, NOFOLLOW

classic Classic list List threaded Threaded
6 messages Options
a a
Reply | Threaded
Open this post in threaded view
|

NOINDEX, NOFOLLOW

a a

hi,

i have a page with <meta name="robots" content="noindex,nofollow" />, now i know that nutch obey to this tag because i dont find the content and the title in my index, but i was wondering that this document will not be present in the index. why he keep the document in my index with no title and no content ??

i'm using index-basic and index-more plugins, and i want to understand why nutch still filling the url, date, boost....etc since he didnt it for title and content.

i was thinking that if nutch will obey to nofollow and noindex so it will skip all the document !

or mabe i missunderstood something, can you plz explain this behavior to me?

best regards.

     
_________________________________________________________________
Windows Live: Make it easier for your friends to see what you’re up to on Facebook.
http://go.microsoft.com/?linkid=9691816
Reply | Threaded
Open this post in threaded view
|

Re: NOINDEX, NOFOLLOW

Kirby Bohling-2
On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAM <[hidden email]> wrote:

>
> hi,
>
> i have a page with <meta name="robots" content="noindex,nofollow" />, now i know that nutch obey to this tag because i dont find the content and the title in my index, but i was wondering that this document will not be present in the index. why he keep the document in my index with no title and no content ??
>
> i'm using index-basic and index-more plugins, and i want to understand why nutch still filling the url, date, boost....etc since he didnt it for title and content.
>
> i was thinking that if nutch will obey to nofollow and noindex so it will skip all the document !
>
> or mabe i missunderstood something, can you plz explain this behavior to me?
>
> best regards.
>

My guess is that the page is recorded to note that the page shouldn't
be fetched, I'm guessing the status is one of the magic values.  It
probably re-fetches the page periodically to ensure it has the list.
So the URL and the date make sense to me as to why they populate them.
 I don't know why it is computing the boost, other then the fact that
it might be part of the OPIC scoring algorithm.  If the scoring
algorithm ever uses the scores/boost of the pages that you point at as
a contributing factor, it would make total sense.  So even though it
doesn't index "http://example/foo/bar", knowing which pages point
there, and what their scores are could contribute scores of pages that
you do index, that contain an outlink to that page.

Kirby
a a
Reply | Threaded
Open this post in threaded view
|

RE: NOINDEX, NOFOLLOW

a a

hi,

thx for these informations, but since i'm using solr index, and when i make a search i get a blank result...
for example if i will have 10 documents as  a search result, 9 will be ok (because i display the title and 4 first lines of content), but i obtain one blank result becoz of this page (with no content and no title) ! i dont understans why it is in the index since it was setted as  noindex !?

here an example:

searchin  for word1:

results:

1- title 1 : content1
2- title 1 : content2
3- title 1 : content3
4- title 1 : content4
5- title 1 : content5
6- title 1 : content6
7- title 1 : content7
8- title 1 : content8
9-    ....BLANK......
10- title 1 : content10





> From: [hidden email]
> Date: Thu, 10 Dec 2009 13:33:18 -0600
> Subject: Re: NOINDEX, NOFOLLOW
> To: [hidden email]
>
> On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAM <[hidden email]> wrote:
> >
> > hi,
> >
> > i have a page with <meta name="robots" content="noindex,nofollow" />, now i know that nutch obey to this tag because i dont find the content and the title in my index, but i was wondering that this document will not be present in the index. why he keep the document in my index with no title and no content ??
> >
> > i'm using index-basic and index-more plugins, and i want to understand why nutch still filling the url, date, boost....etc since he didnt it for title and content.
> >
> > i was thinking that if nutch will obey to nofollow and noindex so it will skip all the document !
> >
> > or mabe i missunderstood something, can you plz explain this behavior to me?
> >
> > best regards.
> >
>
> My guess is that the page is recorded to note that the page shouldn't
> be fetched, I'm guessing the status is one of the magic values.  It
> probably re-fetches the page periodically to ensure it has the list.
> So the URL and the date make sense to me as to why they populate them.
>  I don't know why it is computing the boost, other then the fact that
> it might be part of the OPIC scoring algorithm.  If the scoring
> algorithm ever uses the scores/boost of the pages that you point at as
> a contributing factor, it would make total sense.  So even though it
> doesn't index "http://example/foo/bar", knowing which pages point
> there, and what their scores are could contribute scores of pages that
> you do index, that contain an outlink to that page.
>
> Kirby
     
_________________________________________________________________
Windows Live: Keep your friends up to date with what you do online.
http://go.microsoft.com/?linkid=9691815
Reply | Threaded
Open this post in threaded view
|

Re: NOINDEX, NOFOLLOW

Andrzej Białecki-2
In reply to this post by Kirby Bohling-2
On 2009-12-10 20:33, Kirby Bohling wrote:

> On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAM<[hidden email]>  wrote:
>>
>> hi,
>>
>> i have a page with<meta name="robots" content="noindex,nofollow" />, now i know that nutch obey to this tag because i dont find the content and the title in my index, but i was wondering that this document will not be present in the index. why he keep the document in my index with no title and no content ??
>>
>> i'm using index-basic and index-more plugins, and i want to understand why nutch still filling the url, date, boost....etc since he didnt it for title and content.
>>
>> i was thinking that if nutch will obey to nofollow and noindex so it will skip all the document !
>>
>> or mabe i missunderstood something, can you plz explain this behavior to me?
>>
>> best regards.
>>
>
> My guess is that the page is recorded to note that the page shouldn't
> be fetched, I'm guessing the status is one of the magic values.  It
> probably re-fetches the page periodically to ensure it has the list.
> So the URL and the date make sense to me as to why they populate them.
>   I don't know why it is computing the boost, other then the fact that
> it might be part of the OPIC scoring algorithm.  If the scoring
> algorithm ever uses the scores/boost of the pages that you point at as
> a contributing factor, it would make total sense.  So even though it
> doesn't index "http://example/foo/bar", knowing which pages point
> there, and what their scores are could contribute scores of pages that
> you do index, that contain an outlink to that page.

Very good explanation, that's exactly the reasons why Nutch never
discards such pages. If you really want to ignore certain pages, then
use URLFilters and/or ScoringFilters.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: NOINDEX, NOFOLLOW

Kirby Bohling-2
In reply to this post by a a
On Thu, Dec 10, 2009 at 2:55 PM, BELLINI ADAM <[hidden email]> wrote:
>
> hi,
>
> thx for these informations, but since i'm using solr index, and when i make a search i get a blank result...
> for example if i will have 10 documents as  a search result, 9 will be ok (because i display the title and 4 first lines of content), but i obtain one blank result becoz of this page (with no content and no title) ! i dont understans why it is in the index since it was setted as  noindex !?
>

I've never used the Solr integration, so I'm unable to help you.  This
sounds like a bug to me, but I'm not sure.  Hopefully one of the Solr
users will help us out and let you know what they think.

Thanks,
   Kirby
<...Snip...>
a a
Reply | Threaded
Open this post in threaded view
|

RE: NOINDEX, NOFOLLOW

a a


hi,

since i have custom plugin which parse and index DC meta, i was filling the dc.description and dc.keywords...and since in the solr i was  searching also in description and keywords and display the title and 4 first lines of content, this make the noindexed page to be displayed in the result (as a blank without title and content)...becoz it has a description and keywords.....
no i will add some codes in my pluging to avoid adding description and keywords to all pages that are not indexed (noindex)...and it should resolve my problem.

thx to all for your help



> From: [hidden email]
> Date: Thu, 10 Dec 2009 15:08:26 -0600
> Subject: Re: NOINDEX, NOFOLLOW
> To: [hidden email]
>
> On Thu, Dec 10, 2009 at 2:55 PM, BELLINI ADAM <[hidden email]> wrote:
> >
> > hi,
> >
> > thx for these informations, but since i'm using solr index, and when i make a search i get a blank result...
> > for example if i will have 10 documents as  a search result, 9 will be ok (because i display the title and 4 first lines of content), but i obtain one blank result becoz of this page (with no content and no title) ! i dont understans why it is in the index since it was setted as  noindex !?
> >
>
> I've never used the Solr integration, so I'm unable to help you.  This
> sounds like a bug to me, but I'm not sure.  Hopefully one of the Solr
> users will help us out and let you know what they think.
>
> Thanks,
>    Kirby
> <...Snip...>
     
_________________________________________________________________
Windows Live: Friends get your Flickr, Yelp, and Digg updates when they e-mail you.
http://go.microsoft.com/?linkid=9691817