Finding documents with undefined field

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Finding documents with undefined field

Fabio Confalonieri
Hello,
I would like to search for all documents with a field not defined.
Say You have documents with title_s defined and some documents without title_s: I would like to obtain all documents without title_s.

I've not find a clue, perhaps it's not possible; I would like to know if there is an alternative to setting a default value "undefined" for title_s on all documents.

Thank You

Fabio Confalonieri
Reply | Threaded
Open this post in threaded view
|

Re: Finding documents with undefined field

Chris Hostetter-3
: I would like to search for all documents with a field not defined.
: Say You have documents with title_s defined and some documents without
: title_s: I would like to obtain all documents without title_s.
:
: I've not find a clue, perhaps it's not possible; I would like to know if
: there is an alternative to setting a default value "undefined" for title_s
: on all documents.

This is a fairly common problem with lucene -- selecting the inverse of a
query is hard, because you have to positively select something beore you
can exclude things.

There are some tricks for doing this in Solr if you are writing your own
plugin using DocSets and BitSets, but if you're just using the
StandardRequestHandler then one thing you can do if you've got a field
that always contains at least one indexed value for every doc in your
index (like a uniqueKey field for example) is to use an unbounded range
query on your uniqueKey field to select every document, and then add a
prohibited clause on an unbounded range query for the field you want to
find missing values in, something like this...

   +uniqueKey:[* TO *] -title_s:[* TO *]

...normally in lucene really big range queries are dangerous, but Solr
turns them into ConstantScoreRangeQueries under the covers for you so
there's no big penalty.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Finding documents with undefined field

Fabio Confalonieri
Thank You Hoss (You all are always very responsive...),

actually I've developed my own FacetRequestHandler extending the query format and adding a showfacet parameter (it's a little custom on our needs, but I'd like to publish it when we have finished).
What I do is the merge of some ideas from the forum; my query is now in three parts
  q=query;sort;filters
where filters is a list of query-clauses separated by commas that I parse to get filterField and filterValue, then for every filter:

    filterList.add(QueryParsing.parseQuery(createQueryString("filterField:filterValue", defaultField, req.getSchema()));

then I use filterList in the main query in

    DocListAndSet results = req.getSearcher().getDocListAndSet(query,filterList,sort,...

Then, if requested with showfacets parameter, I get facets extracting and parsing a facetXML descriptor from a facet-type document in the index, querying for the facet descriptor of the current category i get from the filter list (similar to CNET, i think).

To calculate counts for every facet composed of a field and a value, based on the main query, I use

    facetCount = searcher.numDocs(QueryParsing.parseQuery("facetField:facetValue", "", req.getSchema()), results.docSet);

Now, how could I get a fiter for the missing field ?
Can I use the unbounded range trick simply adding a facet (and filter) like this:

    facetCount = searcher.numDocs(QueryParsing.parseQuery("-fieldName:[* TO *]", "", req.getSchema()), results.docSet);

...since i use results.docSet of the base query (the same for filters I think) ?
Or there is a better way ?

Thank You again

   Fabio
Reply | Threaded
Open this post in threaded view
|

Re: Finding documents with undefined field

Chris Hostetter-3

: actually I've developed my own FacetRequestHandler extending the query
: format and adding a showfacet parameter (it's a little custom on our needs,
: but I'd like to publish it when we have finished).

I'd love to see more cutom request handlers ... it's always good to know
i'm not the oly one out there writing them :)

: Then, if requested with showfacets parameter, I get facets extracting and
: parsing a facetXML descriptor from a facet-type document in the index,
: querying for the facet descriptor of the current category i get from the
: filter list (similar to CNET, i think).

yeah, it certianly sounds like it.

: Now, how could I get a fiter for the missing field ?
: Can I use the unbounded range trick simply adding a facet (and filter) like
: this:
:
:     facetCount = searcher.numDocs(QueryParsing.parseQuery("-fieldName:[* TO
: *]", "", req.getSchema()), results.docSet);

I'm pretty sure that won't work s is ... you'll run inot hte sam problem i
was talking about before: your query doesn't positively select anything
(the sole negatived clause just regects things)

There are a couple of things you can do here...

1) Use the same approach i described before if you have a uniqueKey,
search for all things with a key and then exclude things that have a value
in your field.  Since you are writing a request handler, you could also
progromaticaly build up a BooleanQuery containing a MatchAllDocsQuery
object and your prohibited clause even if you don't have a uniqueKey

2) you can fetch the DocSet of all documents that *do* have a value for
that field, and then get the inverse, and use that for your facet counts.
this is something that was discussed before in a thread Erik started...

http://www.nabble.com/request-handler-and-caches-t1593321.html#a4343055

Getting the inverse of a DocSet is currently not a built in operation, you
have to use the getBits() method and operate on it, something like this
should work...

  DocSet definedSet = search.getDocSet(parseQuery("field:[* TO *]"));
  DocSet unDefinedSet = new BitDocSet(fieldDefinedSet.getBits().flip(0,search.maxDoc()))
  int count = unDefinedSet.intersectionCount(results.docSet)

...at least, i think it should work .. i've never really had to worry
about inverted sets.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Finding documents with undefined field

Erik Hatcher

On Jun 7, 2006, at 3:43 PM, Chris Hostetter wrote:

> Getting the inverse of a DocSet is currently not a built in  
> operation, you
> have to use the getBits() method and operate on it, something like  
> this
> should work...
>
>   DocSet definedSet = search.getDocSet(parseQuery("field:[* TO *]"));
>   DocSet unDefinedSet = new BitDocSet(fieldDefinedSet.getBits().flip
> (0,search.maxDoc()))
>   int count = unDefinedSet.intersectionCount(results.docSet)
>
> ...at least, i think it should work .. i've never really had to worry
> about inverted sets.

Here's how I build "inverse" BitSets that represent documents that do  
not have a value in a facet field:

       BitSet catchall = new BitSet(reader.numDocs());

       TermEnum termEnum = reader.terms(new Term(field, ""));
       while (true) {
         Term term = termEnum.term();
         if (term == null || !term.field().equals(field)) break;

         termDocs.seek(term);
         BitSet bitSet = new BitSet(reader.numDocs());
         while (termDocs.next()) {
           bitSet.set(termDocs.doc());
         }

         catchall.or(bitSet);

         // ... cache bitSet ...

         if (! termEnum.next()) break;
       }

       // ... cache catchall ...

Solr's DocSets are a better way to go in the long run, I'm convinced  
- I'm just now starting to leverage them in other ways.  I do still  
need to do these kinds of inverted sets somehow.

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: Finding documents with undefined field

Yonik Seeley
On 6/7/06, Erik Hatcher <[hidden email]> wrote:
> Solr's DocSets are a better way to go in the long run, I'm convinced
> - I'm just now starting to leverage them in other ways.

Some random performance numbers... when I enabled HashDocSet support,
performance of CNET shoppers faceted browsing requests increased by
3.6 times.  Another faceted browsing system could not have been built
at all due to the huge number of facets (most very small, so
HashDocSet allowed them to all fit).

>  I do still
> need to do these kinds of inverted sets somehow.

One problem is that not() needs to know how large the sets are.  I
could add a DocSet.flip(int maxDoc) or a DocSet.flip(int startIndex,
int endIndex) or something like that... but a user would need to know
what maxDoc is...

DocSet.andNot(DocSet other) would be doable w/o knowledge of maxDoc though.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: Finding documents with undefined field

Erik Hatcher

On Jun 7, 2006, at 4:18 PM, Yonik Seeley wrote:

>>  I do still
>> need to do these kinds of inverted sets somehow.
>
> One problem is that not() needs to know how large the sets are.  I
> could add a DocSet.flip(int maxDoc) or a DocSet.flip(int startIndex,
> int endIndex) or something like that... but a user would need to know
> what maxDoc is...
>
> DocSet.andNot(DocSet other) would be doable w/o knowledge of maxDoc  
> though.

The code for building these DocSet's would be in the cache warming  
phase of Solr, right where the IndexReader is readily available.  I  
don't see a problem requiring a size or range parameters from a  
client code perspective.

        Erik

Reply | Threaded
Open this post in threaded view
|

Re: Finding documents with undefined field

Chris Hostetter-3
In reply to this post by Yonik Seeley

: One problem is that not() needs to know how large the sets are.  I
: could add a DocSet.flip(int maxDoc) or a DocSet.flip(int startIndex,
: int endIndex) or something like that... but a user would need to know
: what maxDoc is...

alternately, we could make maxDoc an intrinsic and immutable part of the
DocSet API ... as an "int getMaxSize()" method in the interface perhaps.
The various constructors for the implimenting clases could require it as a
constructor argument (unless they are being constructed from an existing
impl, in which case they could just ask)

in the context of a BitSet it makes a lot of sense to let the "max size"
of the BitSet grow as needed ... but in the context of a Lucene index he
number of docs that can ever posisbly be in a DocSet can never changed (if
it des, the DocSet is invalid because ids may have moved)

: DocSet.andNot(DocSet other) would be doable w/o knowledge of maxDoc though.

yeah ... and getting a DocSet that matches all docs in the index would be
easy enough using a MatchAllDocsQuery.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: Finding documents with undefined field

Fabio Confalonieri
In reply to this post by Chris Hostetter-3
Chris Hostetter wrote
There are a couple of things you can do here...

1) Use the same approach i described before if you have a uniqueKey,
search for all things with a key and then exclude things that have a value
in your field.  Since you are writing a request handler, you could also
progromaticaly build up a BooleanQuery containing a MatchAllDocsQuery
object and your prohibited clause even if you don't have a uniqueKey

2) you can fetch the DocSet of all documents that *do* have a value for
that field, and then get the inverse, and use that for your facet counts.
this is something that was discussed before in a thread Erik started...
..
Ok at last I tried the easy way so, when I find a particular predefined
"undefined-value" in a filter or facet, I convert the query to parse to:

   "type:ad AND -" +field+":[* TO *]"

"type:ad" matches all my documents, the other type I have is "facets"
 (many thanks for the unbound range trick).

I cannot see any particular slowliness (but I'm testing with 50.000 docs
now) perhaps thanks to Solr ConstantScoreRangeQueries conversion,
should I worry with bigger numbers, say 300.000 docs ?

My two cents on Solr development: surely "DocSet.andNot(DocSet other)"
capability would be precious to optimize the undefined-field and other
inverse-query problems.

Thanks again

    Fabio
Reply | Threaded
Open this post in threaded view
|

Re: Finding documents with undefined field

Yonik Seeley
On 6/8/06, Fabio Confalonieri <[hidden email]> wrote:

> Ok at last I tried the easy way so, when I find a particular predefined
> "undefined-value" in a filter or facet, I convert the query to parse to:
>
>    "type:ad AND -" +field+":[* TO *]"
>
> "type:ad" matches all my documents, the other type I have is "facets"
>  (many thanks for the unbound range trick).
>
> I cannot see any particular slowliness (but I'm testing with 50.000 docs
> now) perhaps thanks to Solr ConstantScoreRangeQueries conversion,
> should I worry with bigger numbers, say 300.000 docs ?

Provided you have the memory for the number  of facets you are using,
the filterCache should handle any slowness problem.

There are optimizations that could be done to speed up getting the
DocSets (filters) for simple queries, but it hasn't been a priority
given that we operate off the filter cache so much.

-Yonik