How to index and return files names ?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

How to index and return files names ?

Arnaud Goupil
Hi,

I would like Nutch to return results when search terms
are found in the name of files known by the index.

For example, my http location indexed by nutch
contains various files, named :


computer security.pdf
unix shell.pdf
motherboard specifications.pdf


If I search "motherboard", I want Nutch to return a
result pointing to my third document, even if this
document does not contain the word "motherboard", only
because it's in the name of the file.

Is there a way to do this ?

Thanks

__________________________________________________
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible contre les messages non sollicités
http://mail.yahoo.fr Yahoo! Mail
Reply | Threaded
Open this post in threaded view
|

RE: How to index and return files names ?

Alan Tanaman
Arnaud,

Absolutely.  As Nutch comes, the url field is searchable (and tokenized).
You predicate the search to a specific field using a colon, for example by
typing

url:motherboard or url:"unix shell"

The default search field (when no predicate is specified) is content.

Generally the Lucene search syntax is supported (although I believe there
are Nutch specific issues):
http://lucene.apache.org/java/docs/queryparsersyntax.html

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions
Tel: +44 (20) 7257 6125
Mobile: +44 (7796) 932 362
http://blog.idna-solutions.com

-----Original Message-----
From: Arnaud Goupil [mailto:[hidden email]]
Sent: 10 January 2007 10:04
To: [hidden email]
Subject: How to index and return files names ?

Hi,

I would like Nutch to return results when search terms
are found in the name of files known by the index.

For example, my http location indexed by nutch
contains various files, named :


computer security.pdf
unix shell.pdf
motherboard specifications.pdf


If I search "motherboard", I want Nutch to return a
result pointing to my third document, even if this
document does not contain the word "motherboard", only
because it's in the name of the file.

Is there a way to do this ?

Thanks

__________________________________________________
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection
possible contre les messages non sollicités
http://mail.yahoo.fr Yahoo! Mail

Reply | Threaded
Open this post in threaded view
|

Re: How to index and return files names ?

Enis Soztutar
Alan Tanaman wrote:

> Arnaud,
>
> Absolutely.  As Nutch comes, the url field is searchable (and tokenized).
> You predicate the search to a specific field using a colon, for example by
> typing
>
> url:motherboard or url:"unix shell"
>
> The default search field (when no predicate is specified) is content.
>
> Generally the Lucene search syntax is supported (although I believe there
> are Nutch specific issues):
> http://lucene.apache.org/java/docs/queryparsersyntax.html
>
> Best regards,
> Alan
> _________________________
> Alan Tanaman
> iDNA Solutions
> Tel: +44 (20) 7257 6125
> Mobile: +44 (7796) 932 362
> http://blog.idna-solutions.com
>
> -----Original Message-----
> From: Arnaud Goupil [mailto:[hidden email]]
> Sent: 10 January 2007 10:04
> To: [hidden email]
> Subject: How to index and return files names ?
>
> Hi,
>
> I would like Nutch to return results when search terms
> are found in the name of files known by the index.
>
> For example, my http location indexed by nutch
> contains various files, named :
>
>
> computer security.pdf
> unix shell.pdf
> motherboard specifications.pdf
>
>
> If I search "motherboard", I want Nutch to return a
> result pointing to my third document, even if this
> document does not contain the word "motherboard", only
> because it's in the name of the file.
>
> Is there a way to do this ?
>
> Thanks
>
> __________________________________________________
> Do You Yahoo!?
> En finir avec le spam? Yahoo! Mail vous offre la meilleure protection
> possible contre les messages non sollicités
> http://mail.yahoo.fr Yahoo! Mail
>
>
>  
As Alan suggested, you should search the url field. For searching the
url field, you should include query-url plugin. But query-basic also
queries the url field without adding the url: prefix to the query.
Also I suggest you to use the URLTokenizer in the
http://issues.apache.org/jira/browse/NUTCH-389, which tokenizes the urls
better.



Reply | Threaded
Open this post in threaded view
|

Re: How to index and return files names ?

obrienk

Hi Guys,

The problem I've found with the url: field is that if you try to search for a word document with
url:doc  it will not only return foo.doc but also things like /doc/text.html.

So is there an easy way to search on file type?  I don't believe it's indexed out of the box, but that way Arnaud could do searches such as:

filetype:pdf motherboard

Regards,
Karl.


Enis Soztutar wrote
Alan Tanaman wrote:
> Arnaud,
>
> Absolutely.  As Nutch comes, the url field is searchable (and tokenized).
> You predicate the search to a specific field using a colon, for example by
> typing
>
> url:motherboard or url:"unix shell"
>
> The default search field (when no predicate is specified) is content.
>
> Generally the Lucene search syntax is supported (although I believe there
> are Nutch specific issues):
> http://lucene.apache.org/java/docs/queryparsersyntax.html
>
> Best regards,
> Alan
Reply | Threaded
Open this post in threaded view
|

Re: How to index and return files names ?

Brian Whitman

On Jan 10, 2007, at 8:48 AM, obrienk wrote:

>
> So is there an easy way to search on file type?  I don't believe it's
> indexed out of the box, but that way Arnaud could do searches such as:
>
> filetype:pdf motherboard


If you enable query-more and index-more plugins in your nutch-
site.xml, you can search by type:(mime-type) like your example.

-Brian



Reply | Threaded
Open this post in threaded view
|

RE : RE: How to index and return files names ?

Arnaud Goupil
In reply to this post by Alan Tanaman
Thanks Alan (and all which answered),

I will look at this solution.

Arnaud

--- Alan Tanaman <[hidden email]> a
écrit :

> Arnaud,
>
> Absolutely.  As Nutch comes, the url field is
> searchable (and tokenized).
> You predicate the search to a specific field using a
> colon, for example by
> typing
>
> url:motherboard or url:"unix shell"
>
> The default search field (when no predicate is
> specified) is content.
>
> Generally the Lucene search syntax is supported
> (although I believe there
> are Nutch specific issues):
>
http://lucene.apache.org/java/docs/queryparsersyntax.html

>
> Best regards,
> Alan
> _________________________
> Alan Tanaman
> iDNA Solutions
> Tel: +44 (20) 7257 6125
> Mobile: +44 (7796) 932 362
> http://blog.idna-solutions.com
>
> -----Original Message-----
> From: Arnaud Goupil [mailto:[hidden email]]
> Sent: 10 January 2007 10:04
> To: [hidden email]
> Subject: How to index and return files names ?
>
> Hi,
>
> I would like Nutch to return results when search
> terms
> are found in the name of files known by the index.
>
> For example, my http location indexed by nutch
> contains various files, named :
>
>
> computer security.pdf
> unix shell.pdf
> motherboard specifications.pdf
>
>
> If I search "motherboard", I want Nutch to return a
> result pointing to my third document, even if this
> document does not contain the word "motherboard",
> only
> because it's in the name of the file.
>
> Is there a way to do this ?
>
> Thanks
>
> __________________________________________________
> Do You Yahoo!?
> En finir avec le spam? Yahoo! Mail vous offre la
> meilleure protection
> possible contre les messages non sollicités
> http://mail.yahoo.fr Yahoo! Mail
>
>



       

       
               
___________________________________________________________________________
Découvrez une nouvelle façon d'obtenir des réponses à toutes vos questions !
Profitez des connaissances, des opinions et des expériences des internautes sur Yahoo! Questions/Réponses
http://fr.answers.yahoo.com