Nutch and fileparsers.

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch and fileparsers.

noneone
This post was updated on .
CONTENTS DELETED
The author has deleted this message.
Reply | Threaded
Open this post in threaded view
|

Re: Nutch and fileparsers.

Markus N.
Hi,

don´t know how to solve your PDF-Problem with Nutch, but a simple solution might be to append "#search=searchterm(s)" to the url of the PDF (e.g. when searching for "test": http://xyz.com/foundPDF#search=test). This will open the searchbox of Acrobat Reader. It isn´t working very well but better than nothing and a very quick patch...

Markus

Gilbert Groenendijk wrote
HI,

Currently i have 2 questions about the fileformat parsers. I would like to
know how the PDF parser handles PDF files. Is it possible to split a PDF
page by page ? so if you find a match on a specific page, you can go to the
matched page like #page=12. The other question is about content 'filtering'
What happens if i index a Powerpoint with the header 'CompanyName
Presentation'? Basically the word Presentation is irrelevant but the
Companyname isn't. It is on every page which gives me 'Garbage' in the
index. Someone any thoughts about this? Thanks in advance.

--
Gilbert
Reply | Threaded
Open this post in threaded view
|

RE: Nutch and fileparsers.

Alan Tanaman
In reply to this post by noneone
Gilbert,

Regarding splitting documents up, might I suggest you take a look at a
couple of the threads (and all the responses) on the dev mailing lists?
http://www.mail-archive.com/nutch-dev@.../msg05412.html
http://www.mail-archive.com/nutch-dev@.../msg05374.html

Although these refer to the RSS parser, you could do something similar with
PDF or any other parser that produces documents that are to be split and
indexed as separate documents.  It would seem, however, to require a fair
number of changes to the Nutch code.

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions
http://blog.idna-solutions.com

-----Original Message-----
From: Gilbert Groenendijk [mailto:[hidden email]]
Sent: 07 February 2007 09:53
To: [hidden email]
Subject: Nutch and fileparsers.

HI,

Currently i have 2 questions about the fileformat parsers. I would like to
know how the PDF parser handles PDF files. Is it possible to split a PDF
page by page ? so if you find a match on a specific page, you can go to the
matched page like #page=12. The other question is about content 'filtering'
What happens if i index a Powerpoint with the header 'CompanyName
Presentation'? Basically the word Presentation is irrelevant but the
Companyname isn't. It is on every page which gives me 'Garbage' in the
index. Someone any thoughts about this? Thanks in advance.

--
Gilbert