merely a suggestion: schema.xml validator or better schema validation logging

classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|

merely a suggestion: schema.xml validator or better schema validation logging

Jed Reynolds-2
First time user. Not interested in flamewar, just making a suggestion.

I just got Solr working with my own schema and it was only a little more
mysterious than I expected, having previously dealth with Nutch. Solr is
exactly what I wanted in terms of (theoretical) ease of configurability.

However, my first try at defining a schema.xml file was tough because my
only feedback for a long time was "NullPointerException" from SolrCore
when I was trying to add content. I deduce what was happening was when
SolrCore tried invoking methods on the schema instance, the schema
instance was null.

 From a design point of view, this could easily be modeled with the
NullObject pattern, and an InvalidSchema object could be substituted as
a default schema object. Method invocations to that schema would
appropriately log why the proper schema failed to validate and substantiate.

I'd think that since the capacity to define a schema via XML is so
attractively powerful, that providing feedback on bad schemata would
really speed deployment and adoption.  It turned out that I had
misspelled the unique key field reference. Silly, but can't be uncommon
for a first time user.

If there is already a method of pre-validating the schema, noting it on
the wiki would be really helpful.

So far, that has been my only hangup. This has been so much easier and
appropriate than Nutch I've been gung-ho all week setting this up. Thank
you!


Jed
Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Bertrand Delacretaz
On 3/2/07, Jed Reynolds <[hidden email]> wrote:

> ...my first try at defining a schema.xml file was tough because my
> only feedback for a long time was "NullPointerException" from SolrCore
> when I was trying to add content...

Can you give us enough information to reproduce the problem? What was
wrong in your schema, exactly?

Please indicate also which version of Solr you used.

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Yonik Seeley-2
In reply to this post by Jed Reynolds-2
Hi Jed,

NullPointerException when adding a document w/o the uniqueKey field is
a known bug, and should be fixed shortly.

If the actual schema was null, then that was probably some problem
parsing the schema.
If that's the case, hopefully you saw an exception in the logs on startup?

Anyway, I agree that some config errors could be handled in a more
user-friendly manner, and it would be nice if config failures could
make it to the front-page admin screen or something.

-Yonik


On 3/2/07, Jed Reynolds <[hidden email]> wrote:

> First time user. Not interested in flamewar, just making a suggestion.
>
> I just got Solr working with my own schema and it was only a little more
> mysterious than I expected, having previously dealth with Nutch. Solr is
> exactly what I wanted in terms of (theoretical) ease of configurability.
>
> However, my first try at defining a schema.xml file was tough because my
> only feedback for a long time was "NullPointerException" from SolrCore
> when I was trying to add content. I deduce what was happening was when
> SolrCore tried invoking methods on the schema instance, the schema
> instance was null.
>
>  From a design point of view, this could easily be modeled with the
> NullObject pattern, and an InvalidSchema object could be substituted as
> a default schema object. Method invocations to that schema would
> appropriately log why the proper schema failed to validate and substantiate.
>
> I'd think that since the capacity to define a schema via XML is so
> attractively powerful, that providing feedback on bad schemata would
> really speed deployment and adoption.  It turned out that I had
> misspelled the unique key field reference. Silly, but can't be uncommon
> for a first time user.
>
> If there is already a method of pre-validating the schema, noting it on
> the wiki would be really helpful.
>
> So far, that has been my only hangup. This has been so much easier and
> appropriate than Nutch I've been gung-ho all week setting this up. Thank
> you!
>
>
> Jed
>
Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Otis Gospodnetic-2
In reply to this post by Jed Reynolds-2
Hi,

Ah, a convenient thread - I was about to mention that I was able to mistakenly define multiple <tokenizer .../>'s inside a s fieldType's analyzer without getting any kind of an error.  The correct thing to do is to definite 1 tokenizer followed by N* (token)filters.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Yonik Seeley <[hidden email]>
To: [hidden email]
Sent: Friday, March 2, 2007 10:28:31 AM
Subject: Re: merely a suggestion: schema.xml validator or better schema validation logging

Hi Jed,

NullPointerException when adding a document w/o the uniqueKey field is
a known bug, and should be fixed shortly.

If the actual schema was null, then that was probably some problem
parsing the schema.
If that's the case, hopefully you saw an exception in the logs on startup?

Anyway, I agree that some config errors could be handled in a more
user-friendly manner, and it would be nice if config failures could
make it to the front-page admin screen or something.

-Yonik


On 3/2/07, Jed Reynolds <[hidden email]> wrote:

> First time user. Not interested in flamewar, just making a suggestion.
>
> I just got Solr working with my own schema and it was only a little more
> mysterious than I expected, having previously dealth with Nutch. Solr is
> exactly what I wanted in terms of (theoretical) ease of configurability.
>
> However, my first try at defining a schema.xml file was tough because my
> only feedback for a long time was "NullPointerException" from SolrCore
> when I was trying to add content. I deduce what was happening was when
> SolrCore tried invoking methods on the schema instance, the schema
> instance was null.
>
>  From a design point of view, this could easily be modeled with the
> NullObject pattern, and an InvalidSchema object could be substituted as
> a default schema object. Method invocations to that schema would
> appropriately log why the proper schema failed to validate and substantiate.
>
> I'd think that since the capacity to define a schema via XML is so
> attractively powerful, that providing feedback on bad schemata would
> really speed deployment and adoption.  It turned out that I had
> misspelled the unique key field reference. Silly, but can't be uncommon
> for a first time user.
>
> If there is already a method of pre-validating the schema, noting it on
> the wiki would be really helpful.
>
> So far, that has been my only hangup. This has been so much easier and
> appropriate than Nutch I've been gung-ho all week setting this up. Thank
> you!
>
>
> Jed
>



Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Jed Reynolds-2
In reply to this post by Yonik Seeley-2
Yonik Seeley wrote:

> If the actual schema was null, then that was probably some problem
> parsing the schema.
> If that's the case, hopefully you saw an exception in the logs on
> startup?


Using apache-solr-1.1.0-incubating.


Actually not at first, but now I do. But I've gone back and re-created
the (or a similar) error, and what the problem was happened to be the
way I was watching my logs. When I first started, I was just doing a
tail -F on catalina.out, but the exception was throwing to  the logfile
localhost.2007-03-01.log. Ah, tomcat my best old buddy old pal. I've
learned to just do a "tail -F *". I've obviously grown desinsitized by
other java projects throwing exceptions to logs, and by so much logging
duplication between catalina.out and the tomcat  contextual logs.

I almost didn't notice the exception fly by because there's soooo much
log output, and I can see why I might not have noticed. Yay for
scrollback! (Hrm, I might not have wanted to watch logging for 4
instances of solr all at once. Might explain why so much logging.)

Another helpful modification would be returning 500 errors codes in the
header. This would help a script detect error codes without needing to
grep or dom process the result element. The output of my php script to
load documents was showing me the snippet below. Possibly making the
error code configurable might help (I can see cases where forcing a 200
response is useful) .



Array
(
    [errno] => 0
    [errstr] =>
    [response] => HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: text/xml;charset=UTF-8
Content-Length: 1329
Date: Sat, 03 Mar 2007 02:04:12 GMT
Connection: close

<result status="1">java.lang.NullPointerException
        at org.apache.solr.core.SolrCore.update(SolrCore.java:763)
        at
org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:53)
--snip--
</result>
)





> Anyway, I agree that some config errors could be handled in a more
> user-friendly manner, and it would be nice if config failures could
> make it to the front-page admin screen or something.


That would groovy!

I was able to see instances where a field was not defined. Now that I'm
looking at all the log files, I'm seeing the error I should have seen
earlier.

Thanks guys!

Jed

PS Last night I was able to index about 180,000 documents in about 2.5
hours. The resulting index is a bit over 800M. Compared to my
self-crawling with Nutch, this is 1/4 the time to index and 1/30th the
disk space used by indicies. I am really impressed. I threw four
concurrent scripts making 50,000 distinct (select distinct tag from
taglist;) requests at this solr instance and my solr server was serving
50 requests per second per script and the solr server load average was
about 3.2. That's 200 requests per second against a 4 core box. The
tomcat instance was taking 606M ram, resident.


Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Ryan McKinley
>
> I almost didn't notice the exception fly by because there's soooo much
> log output, and I can see why I might not have noticed. Yay for
> scrollback! (Hrm, I might not have wanted to watch logging for 4
> instances of solr all at once. Might explain why so much logging.)

This has bitten me more then once too!

The rationale with the solrconfig stuff is that a broken config should
behave as best it can.  This is great if you are running a real site
with people actively using it - it is a pain in the ass if you are
getting started and don't notice errors.

I'd like to see a "strict" configuration parameter.  If something
fails on startup, nothing would work until it was fixed.  If there is
any interest, I can put this together.

The other one that can confuse you is if you add documents with fields
that are undefined - rather then getting an error, solr adds the
fields that are defined (it may print out an exception somewhere, but
i've never noticed it)


>
> Another helpful modification would be returning 500 errors codes in the
> header. ...

The 'new' RequestHandler framework (apache-solr-1.2-dev) returns a
proper response code (400,500,etc).  It is not (yet) the default
handler for /select, but I hope it gets to be soon.

best
ryan
Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Yonik Seeley-2
On 3/2/07, Ryan McKinley <[hidden email]> wrote:
> The rationale with the solrconfig stuff is that a broken config should
> behave as best it can.

I don't think that's what I was actually going for in this instance
(the schema).
I was focused on getting correct stuff to work correctly, and worry
about incorrect stuff later :-)

> The other one that can confuse you is if you add documents with fields
> that are undefined - rather then getting an error, solr adds the
> fields that are defined (it may print out an exception somewhere, but
> i've never noticed it)

Also unintended.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Ryan McKinley
On 3/2/07, Yonik Seeley <[hidden email]> wrote:
> On 3/2/07, Ryan McKinley <[hidden email]> wrote:
> > The rationale with the solrconfig stuff is that a broken config should
> > behave as best it can.
>
> I don't think that's what I was actually going for in this instance
> (the schema).
> I was focused on getting correct stuff to work correctly, and worry
> about incorrect stuff later :-)
>

sorry, I was referring to solrconfig.xml... if something goes wrong
loading handlers it continues but prints out some log messages.  I
(think) there are code comments somewhere about how it should be ok to
have an error and still keep a working system...  I'd like to be able
to configure a "strict" mode so it does not continue.


> > The other one that can confuse you is if you add documents with fields
> > that are undefined - rather then getting an error, solr adds the
> > fields that are defined (it may print out an exception somewhere, but
> > i've never noticed it)
>
> Also unintended.
>

How do you all feel about returning an error when you add a document
with unknown fields?

I spent a long time tracking down an error with a document set with an
uppercase field name to something configured with a lowercase field.
Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Jed Reynolds-2
In reply to this post by Ryan McKinley
Ryan McKinley wrote:

>>
>> I almost didn't notice the exception fly by because there's soooo much
>> log output, and I can see why I might not have noticed. Yay for
>> scrollback! (Hrm, I might not have wanted to watch logging for 4
>> instances of solr all at once. Might explain why so much logging.)
>
> This has bitten me more then once too!
>
> The rationale with the solrconfig stuff is that a broken config should
> behave as best it can.  This is great if you are running a real site
> with people actively using it - it is a pain in the ass if you are
> getting started and don't notice errors.
>
> I'd like to see a "strict" configuration parameter.  If something
> fails on startup, nothing would work until it was fixed.  If there is
> any interest, I can put this together.

That would be helpful.

> The other one that can confuse you is if you add documents with fields
> that are undefined - rather then getting an error, solr adds the
> fields that are defined (it may print out an exception somewhere, but
> i've never noticed it)
>

I've read about this capability but I haven't experienced it's effects yet.


>> Another helpful modification would be returning 500 errors codes in the
>> header. ...
>
> The 'new' RequestHandler framework (apache-solr-1.2-dev) returns a
> proper response code (400,500,etc).  It is not (yet) the default
> handler for /select, but I hope it gets to be soon.

Bitchen! Looking forward to that.


However, I've got a lot more learning and testing to do. Don't rush
anything on account of me.

Jed
Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Jed Reynolds-2
In reply to this post by Ryan McKinley
Ryan McKinley wrote:

> On 3/2/07, Yonik Seeley <[hidden email]> wrote:
>> On 3/2/07, Ryan McKinley <[hidden email]> wrote:
>> > The rationale with the solrconfig stuff is that a broken config should
>> > behave as best it can.
>>
>> I don't think that's what I was actually going for in this instance
>> (the schema).
>> I was focused on getting correct stuff to work correctly, and worry
>> about incorrect stuff later :-)
>>
>
> sorry, I was referring to solrconfig.xml... if something goes wrong
> loading handlers it continues but prints out some log messages.  I
> (think) there are code comments somewhere about how it should be ok to
> have an error and still keep a working system...  I'd like to be able
> to configure a "strict" mode so it does not continue.
>
>
>> > The other one that can confuse you is if you add documents with fields
>> > that are undefined - rather then getting an error, solr adds the
>> > fields that are defined (it may print out an exception somewhere, but
>> > i've never noticed it)
>>
>> Also unintended.
>>
>
> How do you all feel about returning an error when you add a document
> with unknown fields?

That sounds like a good option to specify in solrconfig.xml.


> I spent a long time tracking down an error with a document set with an
> uppercase field name to something configured with a lowercase field.


Isn't this the kind of error that XML validation is supposed to address?
I completely understand the appeal of loosely validating XML documents,
of course. However, since adding a document to an index is not a
lightweight operation, adding validation doesn't seem unreasonable. If
writing a schema is required for validation, I'm willing to endure that
step. I can certainly see many instances when components in my system
written by other staff won't fit into my Solr schema. A way to enforce a
schema, strictly, in a dev environment, is entirely appropriate for me.


Jed
Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Bertrand Delacretaz
In reply to this post by Ryan McKinley
On 3/3/07, Ryan McKinley <[hidden email]> wrote:

> ...The rationale with the solrconfig stuff is that a broken config should
> behave as best it can.  This is great if you are running a real site
> with people actively using it - it is a pain in the ass if you are
> getting started and don't notice errors....

I think it's a PITA in any case, I like my systems to fail loudly when
something's wrong in the configs (with details about what's happening,
of course).

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Jed Reynolds-2
Bertrand Delacretaz wrote:

> On 3/3/07, Ryan McKinley <[hidden email]> wrote:
>
>> ...The rationale with the solrconfig stuff is that a broken config
>> should
>> behave as best it can.  This is great if you are running a real site
>> with people actively using it - it is a pain in the ass if you are
>> getting started and don't notice errors....
>
> I think it's a PITA in any case, I like my systems to fail loudly when
> something's wrong in the configs (with details about what's happening,
> of course).
>
> -Bertrand
>
I think it's interesting seeing the difference. The system at CNET
obviously needed to fail gracefully before it needed to fail fast. I
have the luxury of a dev environment and fail-fast is exactly the kinda
thing I want so I know about as many limitations and problems as soon as
possible.

Having this behavior toggled would be idea. Version the solrconfig.xml
between a fail-graceful for your production branch and a fail-fast for
your dev branch.

Jed
Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Walter Underwood, Netflix
In reply to this post by Bertrand Delacretaz
I was bit by this, tool. It made getting started a lot harder.
I think I had something outside of an <lst> instead of inside.
More recently, I got a query time exception from a mis-formatted
<mm> field.

Right now, Solr accesses the DOM as needed (at runtime) to fetch
information. There isn't much up-front checking beyond the XML
parser.

wunder

On 3/3/07 12:50 AM, "Bertrand Delacretaz" <[hidden email]> wrote:

> On 3/3/07, Ryan McKinley <[hidden email]> wrote:
>
>> ...The rationale with the solrconfig stuff is that a broken config should
>> behave as best it can.  This is great if you are running a real site
>> with people actively using it - it is a pain in the ass if you are
>> getting started and don't notice errors....
>
> I think it's a PITA in any case, I like my systems to fail loudly when
> something's wrong in the configs (with details about what's happening,
> of course).
>
> -Bertrand

Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Yonik Seeley-2
In reply to this post by Ryan McKinley
On 3/2/07, Ryan McKinley <[hidden email]> wrote:
> How do you all feel about returning an error when you add a document
> with unknown fields?

+1

dynamicField definitions can be used if desired (including "*" to
match every undefined field).

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Jed Reynolds-2
Yonik Seeley wrote:
> On 3/2/07, Ryan McKinley <[hidden email]> wrote:
>> How do you all feel about returning an error when you add a document
>> with unknown fields?
>
> +1
>
> dynamicField definitions can be used if desired (including "*" to
> match every undefined field).

If dynamicField definitions are removed from the schema.xml file (and
your fields are not referencing them), does this have the same effect of
disabling unknown-field generation?

Jed
Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Yonik Seeley-2
On 3/3/07, Jed Reynolds <[hidden email]> wrote:
> If dynamicField definitions are removed from the schema.xml file (and
> your fields are not referencing them), does this have the same effect of
> disabling unknown-field generation?

Yes.  You should get an error if you add a document with a field that
doesn't match a defined field or a dynamic field.

There still may be a bug that Ryan mentioned about unknown fields
simply being ignored, but that should be fixed if true.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Chris Hostetter-3
In reply to this post by Jed Reynolds-2

: I almost didn't notice the exception fly by because there's soooo much
: log output, and I can see why I might not have noticed. Yay for
: scrollback! (Hrm, I might not have wanted to watch logging for 4
: instances of solr all at once. Might explain why so much logging.)

FYI: Solr logs a lot of stuff at the INFO and DEBUG levels, but "errors"
will always be at the SEVERE level (unless they aren't actualy SEVERE and
are just exceptions encountered during trivial unimportant things in which
case they are loged at the WARNING level) it's up to your servlet
container how verbose to be (ie: what level to log)

you should be able to configure it to put WARNING and SEVERE messages in a
seperate log file even.

: > Anyway, I agree that some config errors could be handled in a more
: > user-friendly manner, and it would be nice if config failures could
: > make it to the front-page admin screen or something.
:
: That would groovy!

i've been thinking a Servlet that didn't depend on any special Solr code
(so it will work even if SolrCore isn't initialized) but registeres a log
handler and records the last N messages from Solr above a certain level
would be handy to refer people to when they are having issues and aren't
overly comfortable with log files.

-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Chris Hostetter-3
In reply to this post by Jed Reynolds-2

: > I spent a long time tracking down an error with a document set with an
: > uppercase field name to something configured with a lowercase field.

: Isn't this the kind of error that XML validation is supposed to address?

it could be ... except that:
  1) we can't using standard DTD/XSD style validation because we don't
know all the field names (not to mention dynamic fields)

  2) XML is just one of hte transoports for sending updates ... we expect
to support a lot more customizable formats in the near future.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Chris Hostetter-3
In reply to this post by Walter Underwood, Netflix

: Right now, Solr accesses the DOM as needed (at runtime) to fetch
: information. There isn't much up-front checking beyond the XML
: parser.

bingo, and adding more upfront checking is hard for at least two reasons i
can think of...

1) keeping a DTD up to date is a pain sa new features are added
2) the way some options are passed to plugable classes makes it impossible
to validate (ie: tokenizers, caches, etc...)




-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: merely a suggestion: schema.xml validator or better schema validation logging

Ryan McKinley
In reply to this post by Yonik Seeley-2
>
> There still may be a bug that Ryan mentioned about unknown fields
> simply being ignored, but that should be fixed if true.
>

I just looked into this - /trunk code is fine.

I wasn't noticing the errors because the response code is always 200
with an error included in the xml.  My code was only checking errors
on non-200 response codes

Is there enough general interest in having error response codes to
change the standard web.xml config to let the SolrDispatchFilter
handle /select?

    <init-param>
      <param-name>handle-select</param-name>
      <param-value>true</param-value>
    </init-param>
12