robot exclusion portional of a document

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

robot exclusion portional of a document

Alexander E Genaud
Hello,

As far as I understand, /robots.txt designates which files may and may
not be indexed by the Nutch and other crawlers. However, is there a
method by which site may exclude only sections of a document?

The benefit is most evident in the search hit result description
(snippets) which will often contain navigation links that may not give
useful information about a page. As far as I know, there is no
standard. Does Nutch provide a method for document section exclusion?
Some methods I've seen include:



<!-- robots content="none" -->

not to be indexed

<!-- /robots -->



<!-- FreeFind Begin No Index -->

not to be indexed

<!-- FreeFind End No Index -->


If there is no such feature and this is deemed useful, I would be
willing to implement this feature in code.

Alex
--
CCC7 D19D D107 F079 2F3D BF97 8443 DB5A 6DB8 9CE1
Reply | Threaded
Open this post in threaded view
|

Re: robot exclusion portional of a document

Jérôme Charron
> As far as I understand, /robots.txt designates which files may and may
> not be indexed by the Nutch and other crawlers. However, is there a
> method by which site may exclude only sections of a document?
> Some methods I've seen include:
> <!-- robots content="none" -->
> <!-- FreeFind Begin No Index -->
> If there is no such feature and this is deemed useful, I would be
> willing to implement this feature in code.

I think it could be interested to have such a feature.
I don't know if it is really used in online documents, but for an intranet
crawling it could be usefull.

But since there is no specification about this, you should probably
support the most used :
* <!-- robots content="none" -->
* <noindex>
* <!-- googleon ... -->  <!-- googleoff ... -->
* .....

My 2 cents.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: robot exclusion portional of a document

Alexander E Genaud
In reply to this post by Alexander E Genaud
Thanks for getting back to me Jérôme,

Would you suggest I jump into the Tokenizer? Would we need to
differentiate indexing, summaries, and/or anchors (as google claims to
do)? Should I target 0.7.2 or 0.8-dev?

Aside, perhaps we should add the modified date field (as NutchWax and
others do).

Alex

>> But since there is no specification about this, you should probably
>> support the most used :
>> * <!-- robots content="none" -->
>> * <noindex>
>> * <!-- googleon ... -->  <!-- googleoff ... -->

--
55.67N 12.588E
CCC7 D19D D107 F079 2F3D BF97 8443 DB5A 6DB8 9CE1
--
Reply | Threaded
Open this post in threaded view
|

Re: robot exclusion portional of a document

Nutch Newbie
In reply to this post by Alexander E Genaud
On 5/16/06, Alexander E Genaud <[hidden email]> wrote:

> Hello,
>
> As far as I understand, /robots.txt designates which files may and may
> not be indexed by the Nutch and other crawlers. However, is there a
> method by which site may exclude only sections of a document?
>
> The benefit is most evident in the search hit result description
> (snippets) which will often contain navigation links that may not give
> useful information about a page. As far as I know, there is no
> standard. Does Nutch provide a method for document section exclusion?
> Some methods I've seen include:

It would be cool if one could remove the menu-type navigation content
from the summary snippets. How about removing all <a href=>yyy</a>
from summary?  Summary strips out all html tags but isn't it a good
idea that all <a href> tags gets extra care i.e it removes the
<content> in between. Will it be a good idea to apply the same
principle in Javascripts as well. I am very curious to know if there
are use cases where the above practice brings more minus then plus.

From the top of my head ...if  <a
href=http://abc.co.uk>http://abc.co.uk/</a> where the link is a URL in
such case we will be loosing some info from the summary.. Yes its much
better then having menu/javascripts etc in the summary which gives no
values to user at all.

Any ideas how one could go about fixing the summary problem...

>
> <!-- robots content="none" -->
>
> not to be indexed
>
> <!-- /robots -->
>
>
>
> <!-- FreeFind Begin No Index -->
>
> not to be indexed
>
> <!-- FreeFind End No Index -->
>
>
> If there is no such feature and this is deemed useful, I would be
> willing to implement this feature in code.
>
> Alex
> --
> CCC7 D19D D107 F079 2F3D BF97 8443 DB5A 6DB8 9CE1
>
Reply | Threaded
Open this post in threaded view
|

Re: robot exclusion portional of a document

juan_barbancho_rsi
Hello,

I proposed a idea. You could use a especial tag like meta in the body. This
tag do not show in html browser and do not need HTML comment.

<html>
<body>
HELLO

<meta name="robots" content="noindex">
      <p>
      HELLO NO INDEX
      </p>
</meta>

</body>
</html>






                                                                           
             "Nutch Newbie"                                                
             <nutch.newbie@gma                                            
             il.com>                                                  Para
                                       [hidden email]        
             18/05/2006 08:49                                           cc
                                                                           
                                                                    Asunto
                Por favor,             Re: robot exclusion portional of a  
                responda a             document                            
             nutch-user@lucene                                            
                .apache.org                                                
                                                                           
                                                                           
                                                                           
                                                                           




On 5/16/06, Alexander E Genaud <[hidden email]> wrote:

> Hello,
>
> As far as I understand, /robots.txt designates which files may and may
> not be indexed by the Nutch and other crawlers. However, is there a
> method by which site may exclude only sections of a document?
>
> The benefit is most evident in the search hit result description
> (snippets) which will often contain navigation links that may not give
> useful information about a page. As far as I know, there is no
> standard. Does Nutch provide a method for document section exclusion?
> Some methods I've seen include:

It would be cool if one could remove the menu-type navigation content
from the summary snippets. How about removing all <a href=>yyy</a>
from summary?  Summary strips out all html tags but isn't it a good
idea that all <a href> tags gets extra care i.e it removes the
<content> in between. Will it be a good idea to apply the same
principle in Javascripts as well. I am very curious to know if there
are use cases where the above practice brings more minus then plus.

From the top of my head ...if  <a
href=http://abc.co.uk>http://abc.co.uk/</a> where the link is a URL in
such case we will be loosing some info from the summary.. Yes its much
better then having menu/javascripts etc in the summary which gives no
values to user at all.

Any ideas how one could go about fixing the summary problem...

>
> <!-- robots content="none" -->
>
> not to be indexed
>
> <!-- /robots -->
>
>
>
> <!-- FreeFind Begin No Index -->
>
> not to be indexed
>
> <!-- FreeFind End No Index -->
>
>
> If there is no such feature and this is deemed useful, I would be
> willing to implement this feature in code.
>
> Alex
> --
> CCC7 D19D D107 F079 2F3D BF97 8443 DB5A 6DB8 9CE1
>



Reply | Threaded
Open this post in threaded view
|

Re[2]: robot exclusion portional of a document

Eugen Kochuev
Hello juan,

Thursday, May 18, 2006, 10:18:36 AM, you wrote:

I don't think that such usage of html meta tag is good idea. This will
lead to not valid HTML code.

Google adsense bot uses <!-- ... --> <!-- ... --> HTML comments (if
present) to determine which content to use for targeting. Nutch could
use the same approach. However this won't be a part of any standard.


> Hello,

> I proposed a idea. You could use a especial tag like meta in the body. This
> tag do not show in html browser and do not need HTML comment.

> <html>
> <body>
> HELLO

> <meta name="robots" content="noindex">
>       <p>
>       HELLO NO INDEX
>       </p>
> </meta>

> </body>
> </html>






>                                                                    
>              "Nutch Newbie"                                        
>              <nutch.newbie@gma                                    
>              il.com>                                                Para
>                                        [hidden email]
>              18/05/2006 08:49                                       cc
>                                                                    
>                                                                     Asunto
>                 Por favor,             Re: robot exclusion portional of a
>                 responda a             document                    
>              nutch-user@lucene                                    
>                 .apache.org                                        





--
Best regards,
 Eugen                            mailto:[hidden email]