[jira] Created: (LUCENE-848) Add supported for Wikipediea English as a corpus in the benchmarker stuff

classic Classic list List threaded Threaded
77 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-848) Add supported for Wikipediea English as a corpus in the benchmarker stuff

Nick Burch (Jira)
Add supported for Wikipediea English as a corpus in the benchmarker stuff
-------------------------------------------------------------------------

                 Key: LUCENE-848
                 URL: https://issues.apache.org/jira/browse/LUCENE-848
             Project: Lucene - Java
          Issue Type: New Feature
            Reporter: Steven Parkes
         Assigned To: Steven Parkes


Add support for using Wikipedia for benchmarking. If no one is working on this, I'll start soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-848) Add supported for Wikipediea English as a corpus in the benchmarker stuff

Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Parkes updated LUCENE-848:
---------------------------------

      Component/s: contrib/benchmark
         Priority: Minor  (was: Major)
    Fix Version/s: 2.2

Sorry; it's not a major thing.

> Add supported for Wikipediea English as a corpus in the benchmarker stuff
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>         Assigned To: Steven Parkes
>            Priority: Minor
>             Fix For: 2.2
>
>
> Add support for using Wikipedia for benchmarking. If no one is working on this, I'll start soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-848) Add supported for Wikipediea English as a corpus in the benchmarker stuff

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated LUCENE-848:
-------------------------------

    Attachment: WikipediaHarvester.java

There is some code in LUCENE-826. Here is a newer version.

> Add supported for Wikipediea English as a corpus in the benchmarker stuff
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>         Assigned To: Steven Parkes
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking. If no one is working on this, I'll start soon.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Parkes updated LUCENE-848:
---------------------------------

      Description: Add support for using Wikipedia for benchmarking.  (was: Add support for using Wikipedia for benchmarking. If no one is working on this, I'll start soon.)
    Lucene Fields:   (was: [New])
          Summary: Add supported for Wikipedia English as a corpus in the benchmarker stuff  (was: Add supported for Wikipediea English as a corpus in the benchmarker stuff)

Can't leave the typo in the title. It's bugging me.

Karl, it looks like your stuff grabs individual articles, right? I'm gong to have it download the bzip2 snapshots they provide (and that they prefer you use, if you're getting much).

Question (for Doron and anyone else): the file is xml and it's big, so DOM isn't going to work. I could still use something SAX based but since the format is so tightly controlled, I'm thinking regular expressions would be sufficient and have less dependences. Anyone have opinions on this?

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>         Assigned To: Steven Parkes
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Grant Ingersoll-4

On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote:

>
>      [ https://issues.apache.org/jira/browse/LUCENE-848?
> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Steven Parkes updated LUCENE-848:
> ---------------------------------
>
>       Description: Add support for using Wikipedia for  
> benchmarking.  (was: Add support for using Wikipedia for  
> benchmarking. If no one is working on this, I'll start soon.)
>     Lucene Fields:   (was: [New])
>           Summary: Add supported for Wikipedia English as a corpus  
> in the benchmarker stuff  (was: Add supported for Wikipediea  
> English as a corpus in the benchmarker stuff)
>
> Can't leave the typo in the title. It's bugging me.
>
> Karl, it looks like your stuff grabs individual articles, right?  
> I'm gong to have it download the bzip2 snapshots they provide (and  
> that they prefer you use, if you're getting much).
>
> Question (for Doron and anyone else): the file is xml and it's big,  
> so DOM isn't going to work. I could still use something SAX based  
> but since the format is so tightly controlled, I'm thinking regular  
> expressions would be sufficient and have less dependences. Anyone  
> have opinions on this?


Personally, I think SAX is the way to go, as you'll get handling of  
escape sequences, etc. out of the box.  And seems like it is easier  
to read/maintain????

>
>> Add supported for Wikipedia English as a corpus in the benchmarker  
>> stuff
>> ---------------------------------------------------------------------
>> ---
>>
>>                 Key: LUCENE-848
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>>             Project: Lucene - Java
>>          Issue Type: New Feature
>>          Components: contrib/benchmark
>>            Reporter: Steven Parkes
>>         Assigned To: Steven Parkes
>>            Priority: Minor
>>             Fix For: 2.2
>>
>>         Attachments: WikipediaHarvester.java
>>
>>
>> Add support for using Wikipedia for benchmarking.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Doron Cohen
Grant Ingersoll <[hidden email]> wrote on 28/03/2007 10:44:08:

>
> On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote:
>
> > Question (for Doron and anyone else): the file is xml and it's big,
> > so DOM isn't going to work. I could still use something SAX based
> > but since the format is so tightly controlled, I'm thinking regular
> > expressions would be sufficient and have less dependences. Anyone
> > have opinions on this?
>
>
> Personally, I think SAX is the way to go, as you'll get handling of
> escape sequences, etc. out of the box.  And seems like it is easier
> to read/maintain????

TrecDocMaker is relying on the strict structure of the input data - the
read() method there is "eating" the input stream until reaching points of
interest, and optionally collects (lines of) text, depending on the format
here you may be able to use a variation of this. If input here is not that
strictly defined, SAX would be better.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

steven_parkes
In reply to this post by Grant Ingersoll-4
I checked and there are escape sequences in there. If it was ever
debatable, I think that tips it in favor of SAX. xerces? The
contrib/gdata stuff seems to use it.

I suppose if I'm careful and creative enough, we could share a lot of
the code amongst benchmark ingesters that use XML, should there be more
...

-----Original Message-----
From: Grant Ingersoll [mailto:[hidden email]]
Sent: Wednesday, March 28, 2007 10:44 AM
To: [hidden email]
Subject: Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia
English as a corpus in the benchmarker stuff


On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote:

>
>      [ https://issues.apache.org/jira/browse/LUCENE-848?
> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Steven Parkes updated LUCENE-848:
> ---------------------------------
>
>       Description: Add support for using Wikipedia for  
> benchmarking.  (was: Add support for using Wikipedia for  
> benchmarking. If no one is working on this, I'll start soon.)
>     Lucene Fields:   (was: [New])
>           Summary: Add supported for Wikipedia English as a corpus  
> in the benchmarker stuff  (was: Add supported for Wikipediea  
> English as a corpus in the benchmarker stuff)
>
> Can't leave the typo in the title. It's bugging me.
>
> Karl, it looks like your stuff grabs individual articles, right?  
> I'm gong to have it download the bzip2 snapshots they provide (and  
> that they prefer you use, if you're getting much).
>
> Question (for Doron and anyone else): the file is xml and it's big,  
> so DOM isn't going to work. I could still use something SAX based  
> but since the format is so tightly controlled, I'm thinking regular  
> expressions would be sufficient and have less dependences. Anyone  
> have opinions on this?


Personally, I think SAX is the way to go, as you'll get handling of  
escape sequences, etc. out of the box.  And seems like it is easier  
to read/maintain????

>
>> Add supported for Wikipedia English as a corpus in the benchmarker  
>> stuff
>> ---------------------------------------------------------------------

>> ---
>>
>>                 Key: LUCENE-848
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>>             Project: Lucene - Java
>>          Issue Type: New Feature
>>          Components: contrib/benchmark
>>            Reporter: Steven Parkes
>>         Assigned To: Steven Parkes
>>            Priority: Minor
>>             Fix For: 2.2
>>
>>         Attachments: WikipediaHarvester.java
>>
>>
>> Add support for using Wikipedia for benchmarking.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Grant Ingersoll-4

On Apr 2, 2007, at 3:41 PM, Steven Parkes wrote:

> I checked and there are escape sequences in there. If it was ever
> debatable, I think that tips it in favor of SAX. xerces? The
> contrib/gdata stuff seems to use it.

Xerces should be fine, I think.

>
> I suppose if I'm careful and creative enough, we could share a lot of
> the code amongst benchmark ingesters that use XML, should there be  
> more
> ...
>

Yes, indeed.  May not be necessary initially, but we could support  
XPath or something down the road to allow us to specify what things  
we are interested in.  I wouldn't worry about generalizing too much  
to start with.  Once we have a couple collections then we can go that  
route.

> -----Original Message-----
> From: Grant Ingersoll [mailto:[hidden email]]
> Sent: Wednesday, March 28, 2007 10:44 AM
> To: [hidden email]
> Subject: Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia
> English as a corpus in the benchmarker stuff
>
>
> On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote:
>
>>
>>      [ https://issues.apache.org/jira/browse/LUCENE-848?
>> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>
>> Steven Parkes updated LUCENE-848:
>> ---------------------------------
>>
>>       Description: Add support for using Wikipedia for
>> benchmarking.  (was: Add support for using Wikipedia for
>> benchmarking. If no one is working on this, I'll start soon.)
>>     Lucene Fields:   (was: [New])
>>           Summary: Add supported for Wikipedia English as a corpus
>> in the benchmarker stuff  (was: Add supported for Wikipediea
>> English as a corpus in the benchmarker stuff)
>>
>> Can't leave the typo in the title. It's bugging me.
>>
>> Karl, it looks like your stuff grabs individual articles, right?
>> I'm gong to have it download the bzip2 snapshots they provide (and
>> that they prefer you use, if you're getting much).
>>
>> Question (for Doron and anyone else): the file is xml and it's big,
>> so DOM isn't going to work. I could still use something SAX based
>> but since the format is so tightly controlled, I'm thinking regular
>> expressions would be sufficient and have less dependences. Anyone
>> have opinions on this?
>
>
> Personally, I think SAX is the way to go, as you'll get handling of
> escape sequences, etc. out of the box.  And seems like it is easier
> to read/maintain????
>
>>
>>> Add supported for Wikipedia English as a corpus in the benchmarker
>>> stuff
>>> --------------------------------------------------------------------
>>> -
>
>>> ---
>>>
>>>                 Key: LUCENE-848
>>>                 URL: https://issues.apache.org/jira/browse/ 
>>> LUCENE-848
>>>             Project: Lucene - Java
>>>          Issue Type: New Feature
>>>          Components: contrib/benchmark
>>>            Reporter: Steven Parkes
>>>         Assigned To: Steven Parkes
>>>            Priority: Minor
>>>             Fix For: 2.2
>>>
>>>         Attachments: WikipediaHarvester.java
>>>
>>>
>>> Add support for using Wikipedia for benchmarking.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

steven_parkes
Yes, indeed.  May not be necessary initially, but we could support  
XPath or something down the road to allow us to specify what things  
> I wouldn't worry about generalizing too much  
> to start with.  Once we have a couple collections then we can go that

> route.

My thoughts, too.

I've been looking at the Reuters stuff. It uncompressed the distribution
and then creates per-article files. I can't decide if I think that's a
good idea for Wikipedia. It's big (about 10G uncompressed) and has about
1.2M files (so I've heard; unverified).

On the one hand, creating separate per-article files is "clean" in that
when you then ingest, you only have disk i/o that's going to affect the
ingest performance (as opposed to, say, uncompressing/parsing). On the
other hand, that's a lot of disk i/o (compresses by about 5X) and a lot
of directory lookups.

Anybody have any opinions/relevant past experience?

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Marvin Humphrey

On Apr 2, 2007, at 2:50 PM, Steven Parkes wrote:

> On the one hand, creating separate per-article files is "clean" in  
> that
> when you then ingest, you only have disk i/o that's going to affect  
> the
> ingest performance (as opposed to, say, uncompressing/parsing). On the
> other hand, that's a lot of disk i/o (compresses by about 5X) and a  
> lot
> of directory lookups.

One reason I was expanding the elements into individual files was so  
that I could compare different libraries against Lucene, including  
those in other languages.  It was important to measure the engines  
themselves, not SGML parsers.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486837 ]

Karl Wettin commented on LUCENE-848:
------------------------------------

> Karl, it looks like your stuff grabs individual articles, right? I'm gong to have it download the bzip2 snapshots they provide (and that they prefer you use, if you're getting much).

They also supply the rendered HTML every now and then. It should be enough to change the URL pattern to file:///tmp/wikipedia/. I was considering porting the MediaWiki BNF as a tokenizer, but found it much simpler to just parse the HTML.

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>         Assigned To: Steven Parkes
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Parkes updated LUCENE-848:
---------------------------------

    Attachment: LUCENE-848.txt

This patch is a first cut a wikipedia benchmark support. It downloads the current english pages from the Wikipedia download site ... which, of course, is actually not there right now. I'm not quite sure what's up, but you can find the files at http://download.wikimedia.org/enwiki/20070402/ right now if you want to play.

It adds ExtractWikipedia.java, which uses Xerces-J to grab the individual articles. It writes the articles in the same format as the Reuters stuff, so a generecised ReutersDocMaker, DirDocMaker, works.

The current size of the download file is 2.1G bzip2'd. It's supposed to contain about 1.2M documents but I came out with 2 or 3, I think, so there maybe "extra" files in there. (Some entries are links and I tried to get rid of those, but I may have missed a particular coding or case).

For the first pass, I copied the Reuters steps of decompressing and parsing. This creates big temporary files. Moreover, it creates a big directory tree in the end. (The extractor uses a fixed number of documents per directory and grows the depth of the tree logarithmically, a lot like Lucene segments).

It's not clear how this preprocessing-to-a-directory-tree compares to on the fly decompression, which would require less disk seeks on the input during indexing. May try that at some point ...

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>         Assigned To: Steven Parkes
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: LUCENE-848.txt, WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487600 ]

Steven Parkes commented on LUCENE-848:
--------------------------------------

By the way, that's a rough patch. I'm cleaning it up as I use it to test 847.

Also, I was going to add support to the algorithm format for setting max field length ...

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>         Assigned To: Steven Parkes
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: LUCENE-848.txt, WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487608 ]

Doron Cohen commented on LUCENE-848:
------------------------------------

> Also, I was going to add support to the algorithm format for setting max field length ...

If this means extending the algorithm language, it would be simpler to just base on a property here - in the alg file set that property - "max.field.length=20000" - and then in OpenIndexTask read that new property (see how merge.factor property is read) and set it on the index.


> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>         Assigned To: Steven Parkes
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: LUCENE-848.txt, WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487609 ]

Steven Parkes commented on LUCENE-848:
--------------------------------------

That's what I meant (and did).

If it's okay, I'll bundle it into 848.



> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>         Assigned To: Steven Parkes
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: LUCENE-848.txt, WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487617 ]

Doron Cohen commented on LUCENE-848:
------------------------------------

Seems okay to me (since it's all in the benchmark).

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>         Assigned To: Steven Parkes
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: LUCENE-848.txt, WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned LUCENE-848:
--------------------------------------

    Assignee: Grant Ingersoll  (was: Steven Parkes)

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>         Assigned To: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: LUCENE-848.txt, WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Parkes updated LUCENE-848:
---------------------------------

    Attachment: LUCENE-848.txt

Update of the previous patch. Used Doron's suggestion for variable name. Cleaned up a little (reverted the eol style on build.txt so the diff makes sense; see LUCENE-864 to for fixing the eol-styles in contrib/benchmark.

Right now the test algorithm is wikipedia.alg but I think the idea is to create specific benchmarks, so maybe this should be something like ingest-enwiki meaning a test of ingest rate against wikipedia.

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>         Assigned To: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: LUCENE-848.txt, LUCENE-848.txt, WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12489283 ]

Steven Parkes commented on LUCENE-848:
--------------------------------------

Blah. This patch doesn't work quite right with 1.4. My intention was/is to use xerces to do the xml parsing but the setup doesn't work quite right under 1.4 which has some crimson stuff in rt.jar that I don't (yet) understand.

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>         Assigned To: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: LUCENE-848.txt, LUCENE-848.txt, WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Parkes updated LUCENE-848:
---------------------------------

    Attachment: LUCENE-848.txt

Okay, I've tested this patch against 1.4, 1.5, and 1.6. I've added the xerces lib since we're including other required support jars in lib.

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>         Assigned To: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.2
>
>         Attachments: LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, WikipediaHarvester.java
>
>
> Add support for using Wikipedia for benchmarking.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

1234