caching - filetypes

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

caching - filetypes

sshingler
Hi all,

I'm trying to find out which filetypes nutch will cache.

for example: it does html, but not pdf.

Is there any documentation on how different filetypes are handled?

Is it possible to configure nutch to cache pdfs etc?

Any advice very gratefully received.
Thanks,
Steve
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: caching - filetypes

Ernesto De Santis-4
Hi Steven

I don't know if I understand completely your email.
What you mean with "cache"?

If do you want to crawl pdf's, you need to delete the url filter for that.

In your crawl-urlfilter.txt, do you have a line starting with a minus
and a list of file extensions. Delete pdf extension.

Good luck
Ernesto.
PD: I'm a nutch beginner, but how nobody did response you, I try to help
you.


steven shingler escribió:

> Hi all,
>
> I'm trying to find out which filetypes nutch will cache.
>
> for example: it does html, but not pdf.
>
> Is there any documentation on how different filetypes are handled?
>
> Is it possible to configure nutch to cache pdfs etc?
>
> Any advice very gratefully received.
> Thanks,
> Steve
>
> ------------------------------------------------------------------------
>
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.405 / Virus Database: 268.12.3/445 - Release Date: 11/09/2006
>  

       
       
               
__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: caching - filetypes

Jacob Brunson
>
> I don't know if I understand completely your email.
> What you mean with "cache"?

So if you go with the standard search results page, there is a link to
a cached copy of the page.  If the page was html, then there are no
problems, however, if the page was binary, it returns a http 500
internal server error.

You can see this if you click on the "cached" link of any of the pdf
documents in the search results on my search engine:
http://ldssearch.com/search.jsp?lang=en&query=pdf


>
> steven shingler escribió:
> > Hi all,
> >
> > I'm trying to find out which filetypes nutch will cache.
> >
> > for example: it does html, but not pdf.
> >
> > Is there any documentation on how different filetypes are handled?
> >
> > Is it possible to configure nutch to cache pdfs etc?
> >
> > Any advice very gratefully received.
> > Thanks,
> > Steve
> >
> > ------------------------------------------------------------------------
> >
> > No virus found in this incoming message.
> > Checked by AVG Free Edition.
> > Version: 7.1.405 / Virus Database: 268.12.3/445 - Release Date: 11/09/2006
> >
>
>
>
>
> __________________________________________________
> Preguntá. Respondé. Descubrí.
> Todo lo que querías saber, y lo que ni imaginabas,
> está en Yahoo! Respuestas (Beta).
> ¡Probalo ya!
> http://www.yahoo.com.ar/respuestas
>
>
>


--
http://JacobBrunson.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: caching - filetypes

Alvaro Cabrerizo
Hi,
Watching your website I can see two kind of different results:

 -For example the first hit
http://www.lds.org/newsroom/files/jeff_lindsay_DNA_3.pdf, has no summary and
it produces the problem with cache.

-The third hit belongs to  the second  group,  they have summary and the
cache link goes fine.

So it looks like nutch cant access the content of first groupt hits. Maybe
parse-pdf plugin cant handle this pdf, it could happen, this would also
explains why the title of the first group hits is the URL, and not the title
keep inside pdf document.

If I were you I would crawl only the first hit (
http://www.lds.org/newsroom/files/jeff_lindsay_DNA_3.pdf ), and look the log
file.  If  parse-pdf  cant handle this document you will see a big ERROR
message.

Hope it helps.

Alvaro C.

2006/9/14, Jacob Brunson <[hidden email]>:

>
> >
> > I don't know if I understand completely your email.
> > What you mean with "cache"?
>
> So if you go with the standard search results page, there is a link to
> a cached copy of the page.  If the page was html, then there are no
> problems, however, if the page was binary, it returns a http 500
> internal server error.
>
> You can see this if you click on the "cached" link of any of the pdf
> documents in the search results on my search engine:
> http://ldssearch.com/search.jsp?lang=en&query=pdf
>
>
> >
> > steven shingler escribió:
> > > Hi all,
> > >
> > > I'm trying to find out which filetypes nutch will cache.
> > >
> > > for example: it does html, but not pdf.
> > >
> > > Is there any documentation on how different filetypes are handled?
> > >
> > > Is it possible to configure nutch to cache pdfs etc?
> > >
> > > Any advice very gratefully received.
> > > Thanks,
> > > Steve
> > >
> > >
> ------------------------------------------------------------------------
> > >
> > > No virus found in this incoming message.
> > > Checked by AVG Free Edition.
> > > Version: 7.1.405 / Virus Database: 268.12.3/445 - Release Date:
> 11/09/2006
> > >
> >
> >
> >
> >
> > __________________________________________________
> > Preguntá. Respondé. Descubrí.
> > Todo lo que querías saber, y lo que ni imaginabas,
> > está en Yahoo! Respuestas (Beta).
> > ¡Probalo ya!
> > http://www.yahoo.com.ar/respuestas
> >
> >
> >
>
>
> --
> http://JacobBrunson.com
>
Loading...