PathHierarchyTokenizerFactory single level match

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

PathHierarchyTokenizerFactory single level match

lstusr 5u93n4
Hi,

I have a schema that has a descendent_path field as configured in the
PathTokenizerHierarchyFactory docs:

 <fieldType name="descendent_path" class="solr.TextField">
   <analyzer type="index">
     <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" />
   </analyzer>
   <analyzer type="query">
     <tokenizer class="solr.KeywordTokenizerFactory" />
   </analyzer>
 </fieldType>


Using the example in the docs:  *For example, in the configuration below a
query for Books/NonFic will match documents indexed with values like
Books/NonFic, Books/NonFic/Law, Books/NonFic/Science/Physics, etc. But it
will not match documents indexed with values like Books, or Books/Fic.* This
works great and solves a primary use case.

However, we have a secondary use case where we need to get all documents
that match a single level. For example, let's say I wanted all of the
categories in Books/NonFic/, like Books/NonFic/Science, Books/NonFic/Art,
Books/NonFic/Math, etc..  I can query for Books/NonFic, but this gives me
all children records too. One solution is to query for:

category:Books/NonFic/* -category:Books/NonFic/*/*

which seems like it works, but feels a little clunky.

The other solution I can think of is to put a separate, non-tokenized field
into the document at index time for each record, something like
parentCategory, which would be non-tokenized and indexed (not stored) like
Books/NonFic for each of the Books/NonFic/[Science, Art, Math] documents.
However, with this solution I'm duplicating the information and increasing
my index size. This is not the worst thing, I know, but the field is by far
the largest contributor to the index size already, and doubling the
information there will have a noticeable impact on the disk footprint.

So my question: with a projected index size in the billions of documents,
would you take either one of those two approaches? Or a third that I
haven't thought of?

Thanks,

Kyle
Reply | Threaded
Open this post in threaded view
|

Re: PathHierarchyTokenizerFactory single level match

Erick Erickson
A couple of things.

bq. the field is by far the largest contributor to the index size already,

That's a rather odd statement. It implies that there's very little
else in your documents. If you have any descriptions etc. I'd think
that the category info wouldn't be all that huge in comparison. How
are you measuring?

One alternative would be to index an extra field with just the
_number_ of levels, so Books/NonFic/Science would have a second field
"level_count" set to 3. Now your secondary search becomes
"q=whatever&fq=category:Books/NonFic&fq=level_count:2".

Best,
Erick
On Fri, Nov 23, 2018 at 6:24 AM lstusr 5u93n4 <[hidden email]> wrote:

>
> Hi,
>
> I have a schema that has a descendent_path field as configured in the
> PathTokenizerHierarchyFactory docs:
>
>  <fieldType name="descendent_path" class="solr.TextField">
>    <analyzer type="index">
>      <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" />
>    </analyzer>
>    <analyzer type="query">
>      <tokenizer class="solr.KeywordTokenizerFactory" />
>    </analyzer>
>  </fieldType>
>
>
> Using the example in the docs:  *For example, in the configuration below a
> query for Books/NonFic will match documents indexed with values like
> Books/NonFic, Books/NonFic/Law, Books/NonFic/Science/Physics, etc. But it
> will not match documents indexed with values like Books, or Books/Fic.* This
> works great and solves a primary use case.
>
> However, we have a secondary use case where we need to get all documents
> that match a single level. For example, let's say I wanted all of the
> categories in Books/NonFic/, like Books/NonFic/Science, Books/NonFic/Art,
> Books/NonFic/Math, etc..  I can query for Books/NonFic, but this gives me
> all children records too. One solution is to query for:
>
> category:Books/NonFic/* -category:Books/NonFic/*/*
>
> which seems like it works, but feels a little clunky.
>
> The other solution I can think of is to put a separate, non-tokenized field
> into the document at index time for each record, something like
> parentCategory, which would be non-tokenized and indexed (not stored) like
> Books/NonFic for each of the Books/NonFic/[Science, Art, Math] documents.
> However, with this solution I'm duplicating the information and increasing
> my index size. This is not the worst thing, I know, but the field is by far
> the largest contributor to the index size already, and doubling the
> information there will have a noticeable impact on the disk footprint.
>
> So my question: with a projected index size in the billions of documents,
> would you take either one of those two approaches? Or a third that I
> haven't thought of?
>
> Thanks,
>
> Kyle
Reply | Threaded
Open this post in threaded view
|

Re: PathHierarchyTokenizerFactory single level match

lstusr 5u93n4
Lots of discussion about XY problems on this list lately..... Maybe I'm a
bit guilty. :D

I used the example from the docs to be clear, but our real use case is
indexing file metadata on a large filesystem. With a few fields like owner,
group, mode, lastmodified, filesize, type, and path, the path field is the
only non-numeric, non-date field that can exceed a couple of characters. So
we want to be able to say: give me all of the directories in a particular
parent, and get the answer without the children.

Using the level_count is a great idea. I think this is the way we'll go
here.

Thanks for your help!

Kyle

On Fri, 23 Nov 2018 at 14:18, Erick Erickson <[hidden email]>
wrote:

> A couple of things.
>
> bq. the field is by far the largest contributor to the index size already,
>
> That's a rather odd statement. It implies that there's very little
> else in your documents. If you have any descriptions etc. I'd think
> that the category info wouldn't be all that huge in comparison. How
> are you measuring?
>
> One alternative would be to index an extra field with just the
> _number_ of levels, so Books/NonFic/Science would have a second field
> "level_count" set to 3. Now your secondary search becomes
> "q=whatever&fq=category:Books/NonFic&fq=level_count:2".
>
> Best,
> Erick
> On Fri, Nov 23, 2018 at 6:24 AM lstusr 5u93n4 <[hidden email]> wrote:
> >
> > Hi,
> >
> > I have a schema that has a descendent_path field as configured in the
> > PathTokenizerHierarchyFactory docs:
> >
> >  <fieldType name="descendent_path" class="solr.TextField">
> >    <analyzer type="index">
> >      <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/"
> />
> >    </analyzer>
> >    <analyzer type="query">
> >      <tokenizer class="solr.KeywordTokenizerFactory" />
> >    </analyzer>
> >  </fieldType>
> >
> >
> > Using the example in the docs:  *For example, in the configuration below
> a
> > query for Books/NonFic will match documents indexed with values like
> > Books/NonFic, Books/NonFic/Law, Books/NonFic/Science/Physics, etc. But it
> > will not match documents indexed with values like Books, or Books/Fic.*
> This
> > works great and solves a primary use case.
> >
> > However, we have a secondary use case where we need to get all documents
> > that match a single level. For example, let's say I wanted all of the
> > categories in Books/NonFic/, like Books/NonFic/Science, Books/NonFic/Art,
> > Books/NonFic/Math, etc..  I can query for Books/NonFic, but this gives me
> > all children records too. One solution is to query for:
> >
> > category:Books/NonFic/* -category:Books/NonFic/*/*
> >
> > which seems like it works, but feels a little clunky.
> >
> > The other solution I can think of is to put a separate, non-tokenized
> field
> > into the document at index time for each record, something like
> > parentCategory, which would be non-tokenized and indexed (not stored)
> like
> > Books/NonFic for each of the Books/NonFic/[Science, Art, Math] documents.
> > However, with this solution I'm duplicating the information and
> increasing
> > my index size. This is not the worst thing, I know, but the field is by
> far
> > the largest contributor to the index size already, and doubling the
> > information there will have a noticeable impact on the disk footprint.
> >
> > So my question: with a projected index size in the billions of documents,
> > would you take either one of those two approaches? Or a third that I
> > haven't thought of?
> >
> > Thanks,
> >
> > Kyle
>
Reply | Threaded
Open this post in threaded view
|

Re: PathHierarchyTokenizerFactory single level match

Erick Erickson
Ah, I see. And I doubt there's any position information or vector
information for that field so it's probably as small as it could be
anyway.

One note about stored data, assuming you've set stored="true". It's
all kept in the "fdt" and "fdx" segment files and doesn't have much
effect on the memory requirements for searching. It's only accessed to
return the top N documents, so while the search may look at a zillion
docs, the stored data will only be accessed for, say, the 10 documents
returned. True it occupies a lot of disk space....

Good luck!
On Fri, Nov 23, 2018 at 11:44 AM lstusr 5u93n4 <[hidden email]> wrote:

>
> Lots of discussion about XY problems on this list lately..... Maybe I'm a
> bit guilty. :D
>
> I used the example from the docs to be clear, but our real use case is
> indexing file metadata on a large filesystem. With a few fields like owner,
> group, mode, lastmodified, filesize, type, and path, the path field is the
> only non-numeric, non-date field that can exceed a couple of characters. So
> we want to be able to say: give me all of the directories in a particular
> parent, and get the answer without the children.
>
> Using the level_count is a great idea. I think this is the way we'll go
> here.
>
> Thanks for your help!
>
> Kyle
>
> On Fri, 23 Nov 2018 at 14:18, Erick Erickson <[hidden email]>
> wrote:
>
> > A couple of things.
> >
> > bq. the field is by far the largest contributor to the index size already,
> >
> > That's a rather odd statement. It implies that there's very little
> > else in your documents. If you have any descriptions etc. I'd think
> > that the category info wouldn't be all that huge in comparison. How
> > are you measuring?
> >
> > One alternative would be to index an extra field with just the
> > _number_ of levels, so Books/NonFic/Science would have a second field
> > "level_count" set to 3. Now your secondary search becomes
> > "q=whatever&fq=category:Books/NonFic&fq=level_count:2".
> >
> > Best,
> > Erick
> > On Fri, Nov 23, 2018 at 6:24 AM lstusr 5u93n4 <[hidden email]> wrote:
> > >
> > > Hi,
> > >
> > > I have a schema that has a descendent_path field as configured in the
> > > PathTokenizerHierarchyFactory docs:
> > >
> > >  <fieldType name="descendent_path" class="solr.TextField">
> > >    <analyzer type="index">
> > >      <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/"
> > />
> > >    </analyzer>
> > >    <analyzer type="query">
> > >      <tokenizer class="solr.KeywordTokenizerFactory" />
> > >    </analyzer>
> > >  </fieldType>
> > >
> > >
> > > Using the example in the docs:  *For example, in the configuration below
> > a
> > > query for Books/NonFic will match documents indexed with values like
> > > Books/NonFic, Books/NonFic/Law, Books/NonFic/Science/Physics, etc. But it
> > > will not match documents indexed with values like Books, or Books/Fic.*
> > This
> > > works great and solves a primary use case.
> > >
> > > However, we have a secondary use case where we need to get all documents
> > > that match a single level. For example, let's say I wanted all of the
> > > categories in Books/NonFic/, like Books/NonFic/Science, Books/NonFic/Art,
> > > Books/NonFic/Math, etc..  I can query for Books/NonFic, but this gives me
> > > all children records too. One solution is to query for:
> > >
> > > category:Books/NonFic/* -category:Books/NonFic/*/*
> > >
> > > which seems like it works, but feels a little clunky.
> > >
> > > The other solution I can think of is to put a separate, non-tokenized
> > field
> > > into the document at index time for each record, something like
> > > parentCategory, which would be non-tokenized and indexed (not stored)
> > like
> > > Books/NonFic for each of the Books/NonFic/[Science, Art, Math] documents.
> > > However, with this solution I'm duplicating the information and
> > increasing
> > > my index size. This is not the worst thing, I know, but the field is by
> > far
> > > the largest contributor to the index size already, and doubling the
> > > information there will have a noticeable impact on the disk footprint.
> > >
> > > So my question: with a projected index size in the billions of documents,
> > > would you take either one of those two approaches? Or a third that I
> > > haven't thought of?
> > >
> > > Thanks,
> > >
> > > Kyle
> >