lucene suiteable ? 6 mio recods / day 1k

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

lucene suiteable ? 6 mio recods / day 1k

Christian Brennsteiner-3
hi *,

i am searching for a fulltext index capeable of the following requirements:

index everyday 3 000 000 new records with a validity of N days (e.g.
90 days expiration)
== 34,7 / s
one record is e.g. an url and can be up to 2 k big

http://example.com/somedir/some.html

lucene should use "/" as a word seperator and should e.g. eliminate all ":"

so the following "sentence" shoule be indexed:

http example.com somedir some.html when having the url
http://example.com/somedir/some.html

my main concern about this requirement is that the index should not
grow over time as it always holds
NR OF DAYS * RECORDS PER DAY  and expires the records after a given
time. in my opinione ther must be some background thread always
throwing away expired hits.

is this easilly possible with lucene?

regards chris

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: lucene suiteable ? 6 mio recods / day 1k

Erick Erickson
Well, I'm reasonably sure you could make this work, although it'll
take some effort.

The 3,000,000 records/day should be pretty easy.

Parsing the URLs, if none of the supplied tokenizers do exactly what you
want, you can always make your own. Or you can pre-process the input
if that's easier. e.g. replaceAll("[/:]", " ") then just use one of the
regular
processors.

You can easily delete records. Assume one of your fields is the date
to day resolution. Your daemon could delete by term all records from
90 days ago. Take care to store the date in a convenient form.

Optimizing your index will reclaim all the space from deleted records. It
may take a while to accomplish.


Or you could create a new index every day and use one of the
MultiSearcher kinds of queries. Then you would simply delete the
appropriate index every day. How performant this solution would be
is something I don't have a good feel for, maybe someone else will
chime in.

But all in all, Lucene (or maybe SOLR) could work in this scenario. But
this is a significant amount of data and you'd have to do some testing
to see if you'd get acceptable performance.

Best
Erick


On Fri, Dec 19, 2008 at 6:22 AM, Christian Brennsteiner
<[hidden email]>wrote:

> hi *,
>
> i am searching for a fulltext index capeable of the following requirements:
>
> index everyday 3 000 000 new records with a validity of N days (e.g.
> 90 days expiration)
> == 34,7 / s
> one record is e.g. an url and can be up to 2 k big
>
> http://example.com/somedir/some.html
>
> lucene should use "/" as a word seperator and should e.g. eliminate all ":"
>
> so the following "sentence" shoule be indexed:
>
> http example.com somedir some.html when having the url
> http://example.com/somedir/some.html
>
> my main concern about this requirement is that the index should not
> grow over time as it always holds
> NR OF DAYS * RECORDS PER DAY  and expires the records after a given
> time. in my opinione ther must be some background thread always
> throwing away expired hits.
>
> is this easilly possible with lucene?
>
> regards chris
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: lucene suiteable ? 6 mio recods / day 1k

Aaron Schon
In reply to this post by Christian Brennsteiner-3
Christian,

I do not have an answer for you (hope some of the gurus on this board can provide you an appropriate answer.
However, I would request you share your finding and experience on this list.

We are facing a similar situation and would appreciate if you shared your learning.

Regards
AS



----- Original Message ----
From: Christian Brennsteiner <[hidden email]>
To: [hidden email]
Sent: Friday, December 19, 2008 6:22:40 AM
Subject: lucene suiteable ? 6 mio recods / day 1k

hi *,

i am searching for a fulltext index capeable of the following requirements:

index everyday 3 000 000 new records with a validity of N days (e.g.
90 days expiration)
== 34,7 / s
one record is e.g. an url and can be up to 2 k big

http://example.com/somedir/some.html

lucene should use "/" as a word seperator and should e.g. eliminate all ":"

so the following "sentence" shoule be indexed:

http example.com somedir some.html when having the url
http://example.com/somedir/some.html

my main concern about this requirement is that the index should not
grow over time as it always holds
NR OF DAYS * RECORDS PER DAY  and expires the records after a given
time. in my opinione ther must be some background thread always
throwing away expired hits.

is this easilly possible with lucene?

regards chris

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: lucene suiteable ? 6 mio recods / day 1k

Otis Gospodnetic-2
In reply to this post by Christian Brennsteiner-3
Christian,

You can certainly purge old documents on a daily basis in order to keep the corpus from growing, but note that 3M*90=270M 2K docs may be a bit too much for a single index unless you really have lots of RAM or you don't need queries to be quick.  In other words, you may have to spread this over multiple indices/machines.


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Christian Brennsteiner <[hidden email]>
> To: [hidden email]
> Sent: Friday, December 19, 2008 6:22:40 AM
> Subject: lucene suiteable ? 6 mio recods / day 1k
>
> hi *,
>
> i am searching for a fulltext index capeable of the following requirements:
>
> index everyday 3 000 000 new records with a validity of N days (e.g.
> 90 days expiration)
> == 34,7 / s
> one record is e.g. an url and can be up to 2 k big
>
> http://example.com/somedir/some.html
>
> lucene should use "/" as a word seperator and should e.g. eliminate all ":"
>
> so the following "sentence" shoule be indexed:
>
> http example.com somedir some.html when having the url
> http://example.com/somedir/some.html
>
> my main concern about this requirement is that the index should not
> grow over time as it always holds
> NR OF DAYS * RECORDS PER DAY  and expires the records after a given
> time. in my opinione ther must be some background thread always
> throwing away expired hits.
>
> is this easilly possible with lucene?
>
> regards chris
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: lucene suiteable ? 6 mio recods / day 1k

Christian Brennsteiner
hi otis,

i think that out of 2 k 80 % can be stemmed and many of the words are
duplicates so they would not need full space.
can you give me an idea what in your opinion would mean  "don't need
queries to be quick" ...
i have no idea in what timeframe it could be handeled if it is not
completely in RAM.

regards chris



On Mon, Dec 22, 2008 at 4:41 AM, Otis Gospodnetic
<[hidden email]> wrote:

> Christian
>
> You can certainly purge old documents on a daily basis in order to keep the corpus from growing, but note that 3M*90=270M 2K docs may be a bit too much for a single index unless you really have lots of RAM or you don't need queries to be quick.  In other words, you may have to spread this over multiple indices/machines.
>
>
> Otis --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Christian Brennsteiner <[hidden email]>
>> To: [hidden email]
>> Sent: Friday, December 19, 2008 6:22:40 AM
>> Subject: lucene suiteable ? 6 mio recods / day 1k
>>
>> hi *,
>>
>> i am searching for a fulltext index capeable of the following requirements:
>>
>> index everyday 3 000 000 new records with a validity of N days (e.g.
>> 90 days expiration)
>> == 34,7 / s
>> one record is e.g. an url and can be up to 2 k big
>>
>> http://example.com/somedir/some.html
>>
>> lucene should use "/" as a word seperator and should e.g. eliminate all ":"
>>
>> so the following "sentence" shoule be indexed:
>>
>> http example.com somedir some.html when having the url
>> http://example.com/somedir/some.html
>>
>> my main concern about this requirement is that the index should not
>> grow over time as it always holds
>> NR OF DAYS * RECORDS PER DAY  and expires the records after a given
>> time. in my opinione ther must be some background thread always
>> throwing away expired hits.
>>
>> is this easilly possible with lucene?
>>
>> regards chris
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



--
---------------
Christian Brennsteiner
Linzergasse 21 / 14
5020 Salzburg
Austria / Europe

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Re: lucene suiteable ? 6 mio recods / day 1k

Tom Roberts LUXONLINE
In reply to this post by Christian Brennsteiner-3
AUTOMATIC REPLY
LUX is closed until 5th January 2009



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: lucene suiteable ? 6 mio recods / day 1k

Otis Gospodnetic-2
In reply to this post by Christian Brennsteiner
Hi Christian,

Typically for public facing applications the desire is to have search results be sub-second.  For some applications waiting even a minute or more is OK.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Christian Brennsteiner <[hidden email]>
> To: [hidden email]
> Sent: Monday, December 22, 2008 2:55:01 AM
> Subject: Re: lucene suiteable ? 6 mio recods / day 1k
>
> hi otis,
>
> i think that out of 2 k 80 % can be stemmed and many of the words are
> duplicates so they would not need full space.
> can you give me an idea what in your opinion would mean  "don't need
> queries to be quick" ...
> i have no idea in what timeframe it could be handeled if it is not
> completely in RAM.
>
> regards chris
>
>
>
> On Mon, Dec 22, 2008 at 4:41 AM, Otis Gospodnetic
> wrote:
> > Christian
> >
> > You can certainly purge old documents on a daily basis in order to keep the
> corpus from growing, but note that 3M*90=270M 2K docs may be a bit too much for
> a single index unless you really have lots of RAM or you don't need queries to
> be quick.  In other words, you may have to spread this over multiple
> indices/machines.
> >
> >
> > Otis --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> >> From: Christian Brennsteiner
> >> To: [hidden email]
> >> Sent: Friday, December 19, 2008 6:22:40 AM
> >> Subject: lucene suiteable ? 6 mio recods / day 1k
> >>
> >> hi *,
> >>
> >> i am searching for a fulltext index capeable of the following requirements:
> >>
> >> index everyday 3 000 000 new records with a validity of N days (e.g.
> >> 90 days expiration)
> >> == 34,7 / s
> >> one record is e.g. an url and can be up to 2 k big
> >>
> >> http://example.com/somedir/some.html
> >>
> >> lucene should use "/" as a word seperator and should e.g. eliminate all ":"
> >>
> >> so the following "sentence" shoule be indexed:
> >>
> >> http example.com somedir some.html when having the url
> >> http://example.com/somedir/some.html
> >>
> >> my main concern about this requirement is that the index should not
> >> grow over time as it always holds
> >> NR OF DAYS * RECORDS PER DAY  and expires the records after a given
> >> time. in my opinione ther must be some background thread always
> >> throwing away expired hits.
> >>
> >> is this easilly possible with lucene?
> >>
> >> regards chris
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
>
> --
> ---------------
> Christian Brennsteiner
> Linzergasse 21 / 14
> 5020 Salzburg
> Austria / Europe
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]