storing the document URI in the index

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

storing the document URI in the index

Ard Schrijvers
Hello,

is it possible to configure solr to store the document URI in the lucene index (the URI is not an xml field, but just the document's location)? Or is everybody used to storing the contents of a document in the lucene index (doesn't this imply a much larger index though?), so instead of retrieving the document's content through a seperate fetch over http/filesystem just show the result from the stored content field?

Thx in advance for any help,

Regards Ard




Reply | Threaded
Open this post in threaded view
|

Re: storing the document URI in the index

Erik Hatcher

On Jun 12, 2007, at 8:51 AM, Ard Schrijvers wrote:
> is it possible to configure solr to store the document URI in the  
> lucene index (the URI is not an xml field, but just the document's  
> location)?

Yes.  Set the field to be store and non-indexed, field type "string"  
is what I use.

> Or is everybody used to storing the contents of a document in the  
> lucene index (doesn't this imply a much larger index though?), so  
> instead of retrieving the document's content through a seperate  
> fetch over http/filesystem just show the result from the stored  
> content field?

This all depends on the needs of your project.  Its perfectly fine to  
store the text outside of the index, and that is the way it really  
has to be done for very large indexes where as few fields as possible  
are "stored".

If you're also asking about Solr fetching the remote resource, that  
is a different story altogether, and no it does not do that.  [though  
with the streaming capability you can feed in a document entirely  
from a URL, but I haven't experimented with that feature yet myself]

        Erik

Reply | Threaded
Open this post in threaded view
|

RE: storing the document URI in the index

Ard Schrijvers
Hello Erik,

thanks for the fast answer (sry for my mail not indenting but must use webmail :-( ), but the problem I am facing is that I do not see solr storing the location of the documents it indexed. So, I need to store the location of a document in a field, but I do not see where solr would do this. Fetching the document will be done with the simple cocoon generator, so that is no problem, but of course, I need the url/uri to be in the index. I know I need it as a UN_TOKENIZED STORED field, but just see with LUKE that the location is not present in lucene index when solr "crawls" some directory with xml files,

Regards Ard Schrijvers


Yes.  Set the field to be store and non-indexed, field type "string"  
is what I use.

> Or is everybody used to storing the contents of a document in the  
> lucene index (doesn't this imply a much larger index though?), so  
> instead of retrieving the document's content through a seperate  
> fetch over http/filesystem just show the result from the stored  
> content field?

This all depends on the needs of your project.  Its perfectly fine to  
store the text outside of the index, and that is the way it really  
has to be done for very large indexes where as few fields as possible  
are "stored".

If you're also asking about Solr fetching the remote resource, that  
is a different story altogether, and no it does not do that.  [though  
with the streaming capability you can feed in a document entirely  
from a URL, but I haven't experimented with that feature yet myself]

        Erik




Reply | Threaded
Open this post in threaded view
|

Re: storing the document URI in the index

Otis Gospodnetic-2
In reply to this post by Ard Schrijvers
Ard,

You have to store the URI in a Field yourself.  That means you need to define that field in the schema and you have to set its value when adding documents.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Ard Schrijvers <[hidden email]>
To: [hidden email]
Sent: Tuesday, June 12, 2007 9:02:25 AM
Subject: RE: storing the document URI in the index

Hello Erik,

thanks for the fast answer (sry for my mail not indenting but must use webmail :-( ), but the problem I am facing is that I do not see solr storing the location of the documents it indexed. So, I need to store the location of a document in a field, but I do not see where solr would do this. Fetching the document will be done with the simple cocoon generator, so that is no problem, but of course, I need the url/uri to be in the index. I know I need it as a UN_TOKENIZED STORED field, but just see with LUKE that the location is not present in lucene index when solr "crawls" some directory with xml files,

Regards Ard Schrijvers


Yes.  Set the field to be store and non-indexed, field type "string"  
is what I use.

> Or is everybody used to storing the contents of a document in the  
> lucene index (doesn't this imply a much larger index though?), so  
> instead of retrieving the document's content through a seperate  
> fetch over http/filesystem just show the result from the stored  
> content field?

This all depends on the needs of your project.  Its perfectly fine to  
store the text outside of the index, and that is the way it really  
has to be done for very large indexes where as few fields as possible  
are "stored".

If you're also asking about Solr fetching the remote resource, that  
is a different story altogether, and no it does not do that.  [though  
with the streaming capability you can feed in a document entirely  
from a URL, but I haven't experimented with that feature yet myself]

    Erik







Reply | Threaded
Open this post in threaded view
|

RE: storing the document URI in the index

Ard Schrijvers
Hello Otis,

thanks for the info. Would it a be an improvement to be able to specify in the schema.xml wether or not the URI should be stored or not in a field which name you can also specify in the schema? It might be very well possible that you do not "own" the xml documents you index over http, and at the same time, you do not want to store its contents in the index. Since at indexing time the uri is known, adding it to the index is trivial.

Regards Ard




You have to store the URI in a Field yourself.  That means you need to define that field in the schema and you have to set its value when adding documents.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Ard Schrijvers <[hidden email]>
To: [hidden email]
Sent: Tuesday, June 12, 2007 9:02:25 AM
Subject: RE: storing the document URI in the index

Hello Erik,

thanks for the fast answer (sry for my mail not indenting but must use webmail :-( ), but the problem I am facing is that I do not see solr storing the location of the documents it indexed. So, I need to store the location of a document in a field, but I do not see where solr would do this. Fetching the document will be done with the simple cocoon generator, so that is no problem, but of course, I need the url/uri to be in the index. I know I need it as a UN_TOKENIZED STORED field, but just see with LUKE that the location is not present in lucene index when solr "crawls" some directory with xml files,

Regards Ard Schrijvers


Yes.  Set the field to be store and non-indexed, field type "string"  
is what I use.

> Or is everybody used to storing the contents of a document in the  
> lucene index (doesn't this imply a much larger index though?), so  
> instead of retrieving the document's content through a seperate  
> fetch over http/filesystem just show the result from the stored  
> content field?

This all depends on the needs of your project.  Its perfectly fine to  
store the text outside of the index, and that is the way it really  
has to be done for very large indexes where as few fields as possible  
are "stored".

If you're also asking about Solr fetching the remote resource, that  
is a different story altogether, and no it does not do that.  [though  
with the streaming capability you can feed in a document entirely  
from a URL, but I haven't experimented with that feature yet myself]

    Erik










Reply | Threaded
Open this post in threaded view
|

Re: storing the document URI in the index

Otis Gospodnetic-2
In reply to this post by Ard Schrijvers
I'm afraid I don't understand your question.  Perhaps somebody else does.

Otis

----- Original Message ----
From: Ard Schrijvers <[hidden email]>
To: [hidden email]; [hidden email]
Sent: Tuesday, June 12, 2007 9:23:16 AM
Subject: RE: storing the document URI in the index

Hello Otis,

thanks for the info. Would it a be an improvement to be able to specify in the schema.xml wether or not the URI should be stored or not in a field which name you can also specify in the schema? It might be very well possible that you do not "own" the xml documents you index over http, and at the same time, you do not want to store its contents in the index. Since at indexing time the uri is known, adding it to the index is trivial.

Regards Ard




You have to store the URI in a Field yourself.  That means you need to define that field in the schema and you have to set its value when adding documents.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Ard Schrijvers <[hidden email]>
To: [hidden email]
Sent: Tuesday, June 12, 2007 9:02:25 AM
Subject: RE: storing the document URI in the index

Hello Erik,

thanks for the fast answer (sry for my mail not indenting but must use webmail :-( ), but the problem I am facing is that I do not see solr storing the location of the documents it indexed. So, I need to store the location of a document in a field, but I do not see where solr would do this. Fetching the document will be done with the simple cocoon generator, so that is no problem, but of course, I need the url/uri to be in the index. I know I need it as a UN_TOKENIZED STORED field, but just see with LUKE that the location is not present in lucene index when solr "crawls" some directory with xml files,

Regards Ard Schrijvers


Yes.  Set the field to be store and non-indexed, field type "string"  
is what I use.

> Or is everybody used to storing the contents of a document in the  
> lucene index (doesn't this imply a much larger index though?), so  
> instead of retrieving the document's content through a seperate  
> fetch over http/filesystem just show the result from the stored  
> content field?

This all depends on the needs of your project.  Its perfectly fine to  
store the text outside of the index, and that is the way it really  
has to be done for very large indexes where as few fields as possible  
are "stored".

If you're also asking about Solr fetching the remote resource, that  
is a different story altogether, and no it does not do that.  [though  
with the streaming capability you can feed in a document entirely  
from a URL, but I haven't experimented with that feature yet myself]

    Erik













Reply | Threaded
Open this post in threaded view
|

Re: storing the document URI in the index

Yonik Seeley-2
In reply to this post by Ard Schrijvers
On 6/12/07, Ard Schrijvers <[hidden email]> wrote:
> thanks for the info. Would it a be an improvement to be able to specify in the schema.xml wether or not the URI should be stored or not in a field which name you can also specify in the schema? It might be very well possible that you do not "own" the xml documents you index over http, and at the same time, you do not want to store its contents in the index. Since at indexing time the uri is known, adding it to the index is trivial.


Think of it a different way... Solr isn't indexing XML documents, it's
simply using XML as a serialization format to pass the data to
serialize.  Often, a program is written to read some other data source
(like a database), and send an XML message to Solr to index it (and
hence the XML document only exists for a very brief time).

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: storing the document URI in the index

Walter Underwood, Netflix
In reply to this post by Ard Schrijvers
Solr doesn't have the URL of the document. The document is given
to Solr in an HTTP POST.

Solr is not a web spider, it is a search web service.

wunder


On 6/12/07 6:23 AM, "Ard Schrijvers" <[hidden email]> wrote:

> Hello Otis,
>
> thanks for the info. Would it a be an improvement to be able to specify in the
> schema.xml wether or not the URI should be stored or not in a field which name
> you can also specify in the schema? It might be very well possible that you do
> not "own" the xml documents you index over http, and at the same time, you do
> not want to store its contents in the index. Since at indexing time the uri is
> known, adding it to the index is trivial.
>
> Regards Ard
>
>
>
>
> You have to store the URI in a Field yourself.  That means you need to define
> that field in the schema and you have to set its value when adding documents.
>
> Otis
>  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>
> ----- Original Message ----
> From: Ard Schrijvers <[hidden email]>
> To: [hidden email]
> Sent: Tuesday, June 12, 2007 9:02:25 AM
> Subject: RE: storing the document URI in the index
>
> Hello Erik,
>
> thanks for the fast answer (sry for my mail not indenting but must use webmail
> :-( ), but the problem I am facing is that I do not see solr storing the
> location of the documents it indexed. So, I need to store the location of a
> document in a field, but I do not see where solr would do this. Fetching the
> document will be done with the simple cocoon generator, so that is no problem,
> but of course, I need the url/uri to be in the index. I know I need it as a
> UN_TOKENIZED STORED field, but just see with LUKE that the location is not
> present in lucene index when solr "crawls" some directory with xml files,
>
> Regards Ard Schrijvers
>
>
> Yes.  Set the field to be store and non-indexed, field type "string"
> is what I use.
>
>> Or is everybody used to storing the contents of a document in the
>> lucene index (doesn't this imply a much larger index though?), so
>> instead of retrieving the document's content through a seperate
>> fetch over http/filesystem just show the result from the stored
>> content field?
>
> This all depends on the needs of your project.  Its perfectly fine to
> store the text outside of the index, and that is the way it really
> has to be done for very large indexes where as few fields as possible
> are "stored".
>
> If you're also asking about Solr fetching the remote resource, that
> is a different story altogether, and no it does not do that.  [though
> with the streaming capability you can feed in a document entirely
> from a URL, but I haven't experimented with that feature yet myself]
>
>     Erik
>
>
>
>
>
>
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

RE: storing the document URI in the index

Ard Schrijvers
In reply to this post by Yonik Seeley-2
Thanks Yonik and Walter,

putting it that way, it does make good sense to not store the transient xml file which it is most of the usecases (I was thinking differently because I do have xml files on file system or over http, like from a webdav call)

Anyway, thx for all answers, and again, sry for mails not indenting properly at the moment, it irritates me as well :-)

Regards Ard


> thanks for the info. Would it a be an improvement to be able to specify in the schema.xml wether or not the URI should be stored or not in a field which name you can also specify in the schema? It might be very well possible that you do not "own" the xml documents you index over http, and at the same time, you do not want to store its contents in the index. Since at indexing time the uri is known, adding it to the index is trivial.


Think of it a different way... Solr isn't indexing XML documents, it's
simply using XML as a serialization format to pass the data to
serialize.  Often, a program is written to read some other data source
(like a database), and send an XML message to Solr to index it (and
hence the XML document only exists for a very brief time).

-Yonik



Reply | Threaded
Open this post in threaded view
|

RE: storing the document URI in the index

thorsten
On Tue, 2007-06-12 at 16:33 +0200, Ard Schrijvers wrote:
> Thanks Yonik and Walter,
>
> putting it that way, it does make good sense to not store the transient xml file which it is most of the usecases (I was thinking differently because I do have xml files on file system or over http, like from a webdav call)
>
> Anyway, thx for all answers, and again, sry for mails not indenting properly at the moment, it irritates me as well :-)
>
> Regards Ard

Hi Ard,

you may want to have a look at
http://wiki.apache.org/solr/SolrForrest

salu2
--
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java                      consulting, training and solutions