solr tuple/tag store

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

solr tuple/tag store

Ryan McKinley
Hello-

I am running into some scaling performance problems with SQL that I hope
a clever solr solution could fix.  I've already gone through a bunch of
loops, so I figure I should solicit advice before continuing to chase my
tail.

I have a bunch of things (100K-500K+) that are defined by a set of user
tags.  ryan says: (name=xxx, location=yyy, foo=[aaa,bbb,ccc]), and
alison says (name:zzz, location=bbb) - this list is constantly updating,
it is fed from automated crawlers and user generated content.  The
'names' can be arbitrary, but 99% of them will be ~25 distinct names.

My approach has been to build a repository of all the 'tags' and then as
things come into that repository, I merge all the tags for that entry
into a single 'flat' document and index it with solr.

When my thing+tag count was small, a simple SQL table with a row for
each tag works great:

CREATE TABLE `my_tags` (
   entryID varchar(40) NOT NULL,

   source varchar(40) NOT NULL,
   name varchar(40) NOT NULL,
   value TEXT NOT NULL,
   KEY( entryID ),
   KEY( source )
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

but as the row count gets big(2M+) this gets to be unusable.  To make it
tractable, I am now splitting the tags across a bunch of tables and
pushing the per user name/value pairs into a single text field (stored
with JSON)

CREATE TABLE `my_tags_000` (
  entryID varchar(40) NOT NULL,
  source varchar(40) NOT NULL,
  tags LONGTEXT NOT NULL,
  PRIMARY KEY( entryID, source )
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Then I pick what table that goes into using:
  Math.abs( id.hashCode() )%10

This works OK, but it is still slower then I would like.  DB access is
slow, and it also needs to search across the updating solr index, and
that gets slow since it keeps reopening the searcher (autowarming is off!)

Soooo...  I see a few paths and would love external feedback before
banging my head on this longer.

1. Get help from someone who know more SQL then me and try to make a
pure SQL approach work.  This would need to work with 10M+ tags.  Solr
indexing is then a direct SQL -> solr dump.

2. Figure out how to keep the base Tuple store in solr.  I think this
will require finishing up SOLR-139.  This would keep the the core data
in solr - so there is no good way to 'rebuild' the index.

3. something else?  store input on disk?


Any thoughts / pointers / nay-saying would be really helpful!

thanks
ryan












Reply | Threaded
Open this post in threaded view
|

RE: solr tuple/tag store

Lance Norskog-2
You did not give your queries. I assume that you are searching against the
'entryID' and updating the tag list.

MySQL has a "fulltext" index. I assume this is a KWIC index but do not know.
A "fulltext" index on "entryID" should be very very fast since single-record
results are what Lucene does best.

Lance

-----Original Message-----
From: Ryan McKinley [mailto:[hidden email]]
Sent: Tuesday, October 09, 2007 12:14 PM
To: [hidden email]
Subject: solr tuple/tag store

Hello-

I am running into some scaling performance problems with SQL that I hope a
clever solr solution could fix.  I've already gone through a bunch of loops,
so I figure I should solicit advice before continuing to chase my tail.

I have a bunch of things (100K-500K+) that are defined by a set of user
tags.  ryan says: (name=xxx, location=yyy, foo=[aaa,bbb,ccc]), and alison
says (name:zzz, location=bbb) - this list is constantly updating, it is fed
from automated crawlers and user generated content.  The 'names' can be
arbitrary, but 99% of them will be ~25 distinct names.

My approach has been to build a repository of all the 'tags' and then as
things come into that repository, I merge all the tags for that entry into a
single 'flat' document and index it with solr.

When my thing+tag count was small, a simple SQL table with a row for each
tag works great:

CREATE TABLE `my_tags` (
   entryID varchar(40) NOT NULL,

   source varchar(40) NOT NULL,
   name varchar(40) NOT NULL,
   value TEXT NOT NULL,
   KEY( entryID ),
   KEY( source )
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

but as the row count gets big(2M+) this gets to be unusable.  To make it
tractable, I am now splitting the tags across a bunch of tables and pushing
the per user name/value pairs into a single text field (stored with JSON)

CREATE TABLE `my_tags_000` (
  entryID varchar(40) NOT NULL,
  source varchar(40) NOT NULL,
  tags LONGTEXT NOT NULL,
  PRIMARY KEY( entryID, source )
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Then I pick what table that goes into using:
  Math.abs( id.hashCode() )%10

This works OK, but it is still slower then I would like.  DB access is slow,
and it also needs to search across the updating solr index, and that gets
slow since it keeps reopening the searcher (autowarming is off!)

Soooo...  I see a few paths and would love external feedback before banging
my head on this longer.

1. Get help from someone who know more SQL then me and try to make a pure
SQL approach work.  This would need to work with 10M+ tags.  Solr indexing
is then a direct SQL -> solr dump.

2. Figure out how to keep the base Tuple store in solr.  I think this will
require finishing up SOLR-139.  This would keep the the core data in solr -
so there is no good way to 'rebuild' the index.

3. something else?  store input on disk?


Any thoughts / pointers / nay-saying would be really helpful!

thanks
ryan












Reply | Threaded
Open this post in threaded view
|

Re: solr tuple/tag store

Erik Hatcher
In reply to this post by Ryan McKinley

On Oct 9, 2007, at 3:14 PM, Ryan McKinley wrote:
> 2. Figure out how to keep the base Tuple store in solr.  I think  
> this will require finishing up SOLR-139.  This would keep the the  
> core data in solr - so there is no good way to 'rebuild' the index.

With SOLR-139, cool stuff can be done to 'rebuild' an index  
actually.  Obviously if your store is Solr you'll be using stored  
fields.  So store the most basic stuff, and copyField things around.  
With SOLR-139, to rebuild an index you simply reconfigure the  
copyField settings and basically `touch` each document to reindex it.

I did this with Collex recently as I refactored all of my old Collex  
tag architecture into SOLR-139.   My tag design is nowhere near as  
scalable as the one you're after, I don't think.  Yonik has some  
pretty prescient design ideas here:

        <http://wiki.apache.org/solr/UserTagDesign>

Particularly interesting are the parts about leveraging intra Lucene  
Field matching capability (Phrase/SpanQuery possibilities are pretty  
neat) to reduce the number of fields.

> 3. something else?  store input on disk?

   *gasp*  Inconceivable!  :)

        Erik


Reply | Threaded
Open this post in threaded view
|

Re: solr tuple/tag store

Pieter Berkel
In reply to this post by Lance Norskog-2
Given that the tables are of type InnoDB, I think it's safe to assume that
you're not planning to use MySQL full-text search (only supported on MyISAM
tables).  If you are not concerned about transactional integrity provided by
InnoDB, perhaps you could try using MyISAM tables (although most people
report speed improvements for insert operations (on relatively small data
sets) rather than selects).

Without seeing the actual queries that are slow, it's difficult to determine
what the problem is.  Have you tried using EXPLAIN (
http://dev.mysql.com/doc/refman/5.0/en/explain.html) to check if your query
is using the table indexes effectively?

Pieter



On 10/10/2007, Lance Norskog <[hidden email]> wrote:

>
> You did not give your queries. I assume that you are searching against the
> 'entryID' and updating the tag list.
>
> MySQL has a "fulltext" index. I assume this is a KWIC index but do not
> know.
> A "fulltext" index on "entryID" should be very very fast since
> single-record
> results are what Lucene does best.
>
> Lance
>
Reply | Threaded
Open this post in threaded view
|

Re: solr tuple/tag store

Lance Norskog-2
You could just make a separate Lucene index with the document ID unique and
with multiple tag values.  Your schema would have the entryID as the unique
field and multiple tag values per entryID.

I just made a phrase-suggesting clone of the Spellchecker class that is
almost exactly the same. It indexes multiple second words for each single
first word.  It was my first Lucene project and was very easy to code.

Lance

On 10/9/07, Pieter Berkel <[hidden email]> wrote:

>
> Given that the tables are of type InnoDB, I think it's safe to assume that
> you're not planning to use MySQL full-text search (only supported on
> MyISAM
> tables).  If you are not concerned about transactional integrity provided
> by
> InnoDB, perhaps you could try using MyISAM tables (although most people
> report speed improvements for insert operations (on relatively small data
> sets) rather than selects).
>
> Without seeing the actual queries that are slow, it's difficult to
> determine
> what the problem is.  Have you tried using EXPLAIN (
> http://dev.mysql.com/doc/refman/5.0/en/explain.html) to check if your
> query
> is using the table indexes effectively?
>
> Pieter
>
>
>
> On 10/10/2007, Lance Norskog <[hidden email]> wrote:
> >
> > You did not give your queries. I assume that you are searching against
> the
> > 'entryID' and updating the tag list.
> >
> > MySQL has a "fulltext" index. I assume this is a KWIC index but do not
> > know.
> > A "fulltext" index on "entryID" should be very very fast since
> > single-record
> > results are what Lucene does best.
> >
> > Lance
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: solr tuple/tag store

Ryan McKinley
In reply to this post by Pieter Berkel
Pieter Berkel wrote:
> Given that the tables are of type InnoDB, I think it's safe to assume that
> you're not planning to use MySQL full-text search (only supported on MyISAM
> tables).  

I am only using SQL for the base store - it is only accessed for
updating and generating solr documents.  All search and public access id
is done via solr.

MyISAM?!  I have been bit by corrupt MyISAM tables one too many times ;)


>
> Without seeing the actual queries that are slow, it's difficult to determine
> what the problem is.  Have you tried using EXPLAIN (
> http://dev.mysql.com/doc/refman/5.0/en/explain.html) to check if your query
> is using the table indexes effectively?
>

Yes, the issue is with the number of rows with 10M rows, select(*) can
take > 1 min.  With 10M rows, it was actually faster to remove the index
so that it was forced to do a single iteration through all docs rather
then use the index (I don't fully understand why)

EXPLAIN says it is a simple query using the primary key, but can still
take >30sec to complete!

In general it seems like a bad idea to have mysql tables with lots of
rows...  that is why i'm leaning towards a solr solution.
Reply | Threaded
Open this post in threaded view
|

Re: solr tuple/tag store

Ryan McKinley
In reply to this post by Erik Hatcher

> the most basic stuff, and copyField things around.  With SOLR-139, to
> rebuild an index you simply reconfigure the copyField settings and
> basically `touch` each document to reindex it.
>

had not thought of that... yes, that would work

> Yonik has some pretty prescient design ideas here:
>
>     <http://wiki.apache.org/solr/UserTagDesign>
>

Yonik is quite clever!  this does not even involve bit operations.

In the example:
add to A10, field utag="erik#lucene"   // or "erik lucene", single token
add to A10, field user="erik"  // via copyField
add to A10, field tag="lucene" // via copyField

I take it 'user' needs a fieldType that would only keep the first part
of what is passed in, and 'tag' would be a different type (or params)
that only keeps the later.

To add a 'name', I guess the best approach is to use a dynamic field:
  utag_name="erik#lucene"

I'll give this a try.

thanks
ryan
Reply | Threaded
Open this post in threaded view
|

Re: solr tuple/tag store

Pieter Berkel
In reply to this post by Ryan McKinley
On 10/10/2007, Ryan McKinley <[hidden email]> wrote:

> > Without seeing the actual queries that are slow, it's difficult to
> determine
> > what the problem is.  Have you tried using EXPLAIN (
> > http://dev.mysql.com/doc/refman/5.0/en/explain.html) to check if your
> query
> > is using the table indexes effectively?
> >
>
> Yes, the issue is with the number of rows with 10M rows, select(*) can
> take > 1 min.  With 10M rows, it was actually faster to remove the index
> so that it was forced to do a single iteration through all docs rather
> then use the index (I don't fully understand why)
>
> EXPLAIN says it is a simple query using the primary key, but can still
> take >30sec to complete!
>
> In general it seems like a bad idea to have mysql tables with lots of
> rows...  that is why i'm leaning towards a solr solution.
>


MySQL shouldn't really have any problem working with tables having 10M+ rows
(especially with simple select queries), most likely the issues you are
experiencing are a result of memory limits set in the mysql conf.  If you
want to persevere a little longer, try increasing the values of
"innodb_additional_mem_pool_size" and "innodb_buffer_pool_size" in your
my.cnf config file (see
http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html for more
info).

If there is no compelling reason for sticking with a RDBMS then maybe the
solr solutions iisted above might be better.

Piete