Index arbitrary xml-elments in only one field without copying

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Index arbitrary xml-elments in only one field without copying

thomas arni
Hello

I'm currently evaluate solr for our needs. In a first step I used your
example and adapted the “schema.xml”.

In contrast to the example docs provided I haven't homogeneous
documents, which means I only want to index to two fields. This fields
are the uniqueKey (docno) and a textfield (text).

<fields>
<field name="docno" type="string" indexed="true" stored="true"/>
<field name="text" type="text" indexed="true" stored="true"/>
</fields>

Instead of using the copyField for other XML-elements, to copy (and
duplicate) this fields to my “text”-field, I want to specify which
fields should be indexed directly in the “text”-field without copying
nor duplicating. I have no need for additional index-fields in my
heterogeneous environment. This extra fields only need additional space
in my index, which is a disadvantage for me.


How can I specify arbitrary xml-elements, which should be indexed in my
one and only field “text”. I have no need of additional fields in my
index.


Any help is appreciated.


Thomas

Reply | Threaded
Open this post in threaded view
|

Re: Index arbitrary xml-elments in only one field without copying

Erik Hatcher
Thomas - you will need to do this client-side if you don't want to  
use copyField.  The client needs to gather up all the text you want  
indexed and send that as <field name="text">....</field>

        Erik


On Mar 14, 2007, at 3:50 AM, thomas arni wrote:

> Hello
>
> I'm currently evaluate solr for our needs. In a first step I used your
> example and adapted the “schema.xml”.
>
> In contrast to the example docs provided I haven't homogeneous
> documents, which means I only want to index to two fields. This fields
> are the uniqueKey (docno) and a textfield (text).
>
> <fields>
> <field name="docno" type="string" indexed="true" stored="true"/>
> <field name="text" type="text" indexed="true" stored="true"/>
> </fields>
>
> Instead of using the copyField for other XML-elements, to copy (and
> duplicate) this fields to my “text”-field, I want to specify which
> fields should be indexed directly in the “text”-field without copying
> nor duplicating. I have no need for additional index-fields in my
> heterogeneous environment. This extra fields only need additional  
> space
> in my index, which is a disadvantage for me.
>
>
> How can I specify arbitrary xml-elements, which should be indexed  
> in my
> one and only field “text”. I have no need of additional fields in my
> index.
>
>
> Any help is appreciated.
>
>
> Thomas

Reply | Threaded
Open this post in threaded view
|

AW: Index arbitrary xml-elments in only one field without copying

Burkamp, Christian
You can even put multiple <field name="text">....</field> entries into one document. The text field needs to be defined multi-valued for this to work.
<field name="text" type="text" indexed="true" stored="true" multiValued="true"/>
You can put each chunk of data to its own text field.
Perhaps this approach is best suited for what you want to do?

--Christian

-----Ursprüngliche Nachricht-----
Von: Erik Hatcher [mailto:[hidden email]]
Gesendet: Mittwoch, 14. März 2007 11:55
An: [hidden email]
Betreff: Re: Index arbitrary xml-elments in only one field without copying


Thomas - you will need to do this client-side if you don't want to  
use copyField.  The client needs to gather up all the text you want  
indexed and send that as <field name="text">....</field>

        Erik


On Mar 14, 2007, at 3:50 AM, thomas arni wrote:

> Hello
>
> I'm currently evaluate solr for our needs. In a first step I used your
> example and adapted the "schema.xml".
>
> In contrast to the example docs provided I haven't homogeneous
> documents, which means I only want to index to two fields. This fields
> are the uniqueKey (docno) and a textfield (text).
>
> <fields>
> <field name="docno" type="string" indexed="true" stored="true"/>
> <field name="text" type="text" indexed="true" stored="true"/>
> </fields>
>
> Instead of using the copyField for other XML-elements, to copy (and
> duplicate) this fields to my "text"-field, I want to specify which
> fields should be indexed directly in the "text"-field without copying
> nor duplicating. I have no need for additional index-fields in my
> heterogeneous environment. This extra fields only need additional
> space
> in my index, which is a disadvantage for me.
>
>
> How can I specify arbitrary xml-elements, which should be indexed
> in my
> one and only field "text". I have no need of additional fields in my
> index.
>
>
> Any help is appreciated.
>
>
> Thomas

Reply | Threaded
Open this post in threaded view
|

Re: Index arbitrary xml-elments in only one field without copying

thomas arni
In reply to this post by Erik Hatcher
Thanks for your reply Erik. I will use your suggested approach.

IMHO this could be something to add for future versions of solr. The
Terrier IR-framework for example and other IR solutions allow to specify
different XML-elements, which should be indexed in only one (lucene) field.

As I said in my previous post, this approach is especially helpful, if
you have heterogeneous documents with different XML-elements.



Erik Hatcher wrote:

> Thomas - you will need to do this client-side if you don't want to use
> copyField.  The client needs to gather up all the text you want
> indexed and send that as <field name="text">....</field>
>
>     Erik
>
>
> On Mar 14, 2007, at 3:50 AM, thomas arni wrote:
>
>> Hello
>>
>> I'm currently evaluate solr for our needs. In a first step I used your
>> example and adapted the “schema.xml”.
>>
>> In contrast to the example docs provided I haven't homogeneous
>> documents, which means I only want to index to two fields. This fields
>> are the uniqueKey (docno) and a textfield (text).
>>
>> <fields>
>> <field name="docno" type="string" indexed="true" stored="true"/>
>> <field name="text" type="text" indexed="true" stored="true"/>
>> </fields>
>>
>> Instead of using the copyField for other XML-elements, to copy (and
>> duplicate) this fields to my “text”-field, I want to specify which
>> fields should be indexed directly in the “text”-field without copying
>> nor duplicating. I have no need for additional index-fields in my
>> heterogeneous environment. This extra fields only need additional space
>> in my index, which is a disadvantage for me.
>>
>>
>> How can I specify arbitrary xml-elements, which should be indexed in my
>> one and only field “text”. I have no need of additional fields in my
>> index.
>>
>>
>> Any help is appreciated.
>>
>>
>> Thomas
>

Reply | Threaded
Open this post in threaded view
|

Restrict Servlet Access

Gunther, Andrew
What are people doing to restrict UpdateServlet access on production
installs of Solr.  Are people removing that option and rotating in a new
index or restricting access from the jetty side.

Cheers,
Andrew



Reply | Threaded
Open this post in threaded view
|

Re: Restrict Servlet Access

Erik Hatcher

On Mar 14, 2007, at 10:12 AM, Gunther, Andrew wrote:
> What are people doing to restrict UpdateServlet access on production
> installs of Solr.  Are people removing that option and rotating in  
> a new
> index or restricting access from the jetty side.

The recommendation is to firewall off Solr so only your application  
server can access it.   Solr is not at all designed for direct client  
(browser, etc) access.

        Erik


Reply | Threaded
Open this post in threaded view
|

Re: Restrict Servlet Access

Brian Whitman
>
> The recommendation is to firewall off Solr so only your application  
> server can access it.   Solr is not at all designed for direct  
> client (browser, etc) access.

Assuming you lock down update properly, what's the problem? We are  
currently using select directly through the XSLTResponseWriter right  
into a <div> via Ajax.Updater. Do you predict pain?





Reply | Threaded
Open this post in threaded view
|

Re: Restrict Servlet Access

Erik Hatcher

On Mar 14, 2007, at 11:09 AM, Brian Whitman wrote:

>>
>> The recommendation is to firewall off Solr so only your  
>> application server can access it.   Solr is not at all designed  
>> for direct client (browser, etc) access.
>
> Assuming you lock down update properly, what's the problem? We are  
> currently using select directly through the XSLTResponseWriter  
> right into a <div> via Ajax.Updater. Do you predict pain?

I don't predict pain really, but I don't want to see Solr get bogged  
down in having a lot of security-related code added to it.  I do  
think it would be good for there to be some sort of capability to  
make Solr read-only in some form or another, such that an indexer  
could still work from an authorized environment.

Exposing Solr directly to a client does have appeal in the way you're  
doing it, but it also allows the possibility of hackers tinkering  
with it and perhaps requesting things they shouldn't.  For example,  
we index tags and annotations, and only a logged in user can see  
their own annotations, so exposing Solr directly would subvert that  
protection.

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: Restrict Servlet Access

Jed Reynolds-2
In reply to this post by Gunther, Andrew
Gunther, Andrew wrote:

>What are people doing to restrict UpdateServlet access on production
>installs of Solr.  Are people removing that option and rotating in a new
>index or restricting access from the jetty side.
>  
>

I'm putting Solr on my DMZ without direct WAN access. If I had to put it
on a WAN facing server, I'd hide it behind Apache and access it using
mod_rewrite and use the [P] proxy directive. Using mod_rewrite, by
ignoring the /foo/update URI then you have no external access to that.

Jed
Reply | Threaded
Open this post in threaded view
|

Re: Index arbitrary xml-elments in only one field without copying

Chris Hostetter-3
In reply to this post by thomas arni
:
: IMHO this could be something to add for future versions of solr. The
: Terrier IR-framework for example and other IR solutions allow to specify
: different XML-elements, which should be indexed in only one (lucene) field.

I don't know anythign about Terrier, but there are lots of simple ways to
achieve thigns like this with Solr depending on what exactly you want, two
off the top of my head...

1) use an XSLT on the client to extract only the fields you want from your
XML file and build up the text fields you send to solr (we have the
framework in place for you do even o that XSLT server side)

2) send each element that you care about as a seperate field -- you could
use xpath like descripters for the names, ie...
  <field name="//root/AAA/BBB/CCC">body of tag CCC</field>
...and then use copyField with a wildcard in the source to consolidate all
tags into a single text field...

   <dynamicField name="//*" type="text" indexed="true" stored="true" />
   <copyField source="//*"  dest="text" />





-Hoss

Reply | Threaded
Open this post in threaded view
|

RE: Restrict Servlet Access

Gunther, Andrew
In reply to this post by Erik Hatcher
I'm trying to get my head around the architecture where Solr sits behind a firewall.  Can someone tease this out for me.  Is a jndi context establishing the connection to the app server?  I'm naïve in thinking how one talks to the solr servlet behing a firewall.

I apologize up front for the naivety.

-Andrew



-----Original Message-----
From: Erik Hatcher [mailto:[hidden email]]
Sent: Wednesday, March 14, 2007 11:18 AM
To: [hidden email]
Subject: Re: Restrict Servlet Access


On Mar 14, 2007, at 11:09 AM, Brian Whitman wrote:

>>
>> The recommendation is to firewall off Solr so only your  
>> application server can access it.   Solr is not at all designed  
>> for direct client (browser, etc) access.
>
> Assuming you lock down update properly, what's the problem? We are  
> currently using select directly through the XSLTResponseWriter  
> right into a <div> via Ajax.Updater. Do you predict pain?

I don't predict pain really, but I don't want to see Solr get bogged  
down in having a lot of security-related code added to it.  I do  
think it would be good for there to be some sort of capability to  
make Solr read-only in some form or another, such that an indexer  
could still work from an authorized environment.

Exposing Solr directly to a client does have appeal in the way you're  
doing it, but it also allows the possibility of hackers tinkering  
with it and perhaps requesting things they shouldn't.  For example,  
we index tags and annotations, and only a logged in user can see  
their own annotations, so exposing Solr directly would subvert that  
protection.

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: Restrict Servlet Access

Bess Sadler
Andrew, I don't know if this is what you're getting at, but my  
solution is kind of naive but seems to work well. I have solr running  
on a given port, say :8983. I have my firewall (iptables) set up so  
that the outside world cannot connect to :8983. However, my httpd  
server, running on port 80, can connect to solr because they are  
running on the same box. Therefore all access to solr is mediated  
through whatever applications I choose to run through httpd. This is  
the same approach we've always used for mySQL, and it has served us  
well. When you start talking about JNDI it makes me think you're  
thinking of a more sophisticated system, but it seems like the same  
principles would apply.

Is that what you were asking about?

Bess

On Mar 15, 2007, at 12:26 PM, Gunther, Andrew wrote:

> I'm trying to get my head around the architecture where Solr sits  
> behind a firewall.  Can someone tease this out for me.  Is a jndi  
> context establishing the connection to the app server?  I'm naïve  
> in thinking how one talks to the solr servlet behing a firewall.
>
> I apologize up front for the naivety.
>
> -Andrew
>
>
>
> -----Original Message-----
> From: Erik Hatcher [mailto:[hidden email]]
> Sent: Wednesday, March 14, 2007 11:18 AM
> To: [hidden email]
> Subject: Re: Restrict Servlet Access
>
>
> On Mar 14, 2007, at 11:09 AM, Brian Whitman wrote:
>
>>>
>>> The recommendation is to firewall off Solr so only your
>>> application server can access it.   Solr is not at all designed
>>> for direct client (browser, etc) access.
>>
>> Assuming you lock down update properly, what's the problem? We are
>> currently using select directly through the XSLTResponseWriter
>> right into a <div> via Ajax.Updater. Do you predict pain?
>
> I don't predict pain really, but I don't want to see Solr get bogged
> down in having a lot of security-related code added to it.  I do
> think it would be good for there to be some sort of capability to
> make Solr read-only in some form or another, such that an indexer
> could still work from an authorized environment.
>
> Exposing Solr directly to a client does have appeal in the way you're
> doing it, but it also allows the possibility of hackers tinkering
> with it and perhaps requesting things they shouldn't.  For example,
> we index tags and annotations, and only a logged in user can see
> their own annotations, so exposing Solr directly would subvert that
> protection.
>
> Erik




Reply | Threaded
Open this post in threaded view
|

Re: Restrict Servlet Access

Chris Hostetter-3
: on a given port, say :8983. I have my firewall (iptables) set up so
: that the outside world cannot connect to :8983. However, my httpd
: server, running on port 80, can connect to solr because they are
: running on the same box. Therefore all access to solr is mediated
: through whatever applications I choose to run through httpd. This is
: the same approach we've always used for mySQL, and it has served us

Bingo.  This is definitely what i recommend.  (and i love the analogy of
MySQL, Solr is designed to be a "service" your applications use,
communication protocol justh appens to be HTTP)

If you want to expose the raw Solr responses to the outside world, then
use something like mod_proxy on your webserver with configuration in place
to limit what kinds of requests the outside world can make on your Solr
server.



-Hoss