[jira] Created: (NUTCH-565) Arc File to Nutch Segments Converter

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-565) Arc File to Nutch Segments Converter

Sebastian Nagel (Jira)
Arc File to Nutch Segments Converter
------------------------------------

                 Key: NUTCH-565
                 URL: https://issues.apache.org/jira/browse/NUTCH-565
             Project: Nutch
          Issue Type: Improvement
         Environment: all
            Reporter: Dennis Kubes
            Assignee: Dennis Kubes
             Fix For: 1.0.0


Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-565) Arc File to Nutch Segments Converter

Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-565:
-------------------------------

    Attachment: archive-commons-1.11.0-200612262257.jar

Archive commons jar needed for reading arc files.

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: archive-commons-1.11.0-200612262257.jar
>
>
> Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-565) Arc File to Nutch Segments Converter

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-565:
-------------------------------

    Attachment: nutch-565-1-20071009.patch

An arc file input format, record reader, and utility to convert arc files to nutch segments.  The conversion utility acts in place of the fetcher to convert compressed web pages in arc files into the standard nutch segments format.  All current fetcher rules for url filtering and normalization as well as content parsing still apply.  Currently only text/html conent types are supported within the arc files.  This functionality is meant to be used with hadoop-0.14 or higher.

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: archive-commons-1.11.0-200612262257.jar, nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-565) Arc File to Nutch Segments Converter

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-565:
-------------------------------

    Attachment: fastutil-5.0.3-heritrix-subset-1.0.jar

Also requires some fastutils classes.

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: archive-commons-1.11.0-200612262257.jar, fastutil-5.0.3-heritrix-subset-1.0.jar, nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-565) Arc File to Nutch Segments Converter

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533458 ]

Sami Siren commented on NUTCH-565:
----------------------------------

What are the licenses for those jars?

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: archive-commons-1.11.0-200612262257.jar, fastutil-5.0.3-heritrix-subset-1.0.jar, nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-565) Arc File to Nutch Segments Converter

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533498 ]

Dennis Kubes commented on NUTCH-565:
------------------------------------

Both jars are LGPL.  The archive-commons is from archive.org and is currently used in NutchWax.  The fastutil jar is a subset of fastutil classes used by archive.org.

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: archive-commons-1.11.0-200612262257.jar, fastutil-5.0.3-heritrix-subset-1.0.jar, nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-565) Arc File to Nutch Segments Converter

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533499 ]

Dennis Kubes commented on NUTCH-565:
------------------------------------

Currently the input format uses 1 map task per arc file.  This could be improved in the future by breaking a file into multiple map tasks.

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: archive-commons-1.11.0-200612262257.jar, fastutil-5.0.3-heritrix-subset-1.0.jar, nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-565) Arc File to Nutch Segments Converter

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533787 ]

Sami Siren commented on NUTCH-565:
----------------------------------

bq. Both jars are LGPL.
I think that prohibits direct inclusion then. Take a look at http://people.apache.org/~rubys/3party.html for available options.

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: archive-commons-1.11.0-200612262257.jar, fastutil-5.0.3-heritrix-subset-1.0.jar, nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (NUTCH-565) Arc File to Nutch Segments Converter

Dennis Kubes-2
Does this mean that they can't be attached to JIRA or that they can't be
included in the repository or both?

Dennis Kubes

Sami Siren (JIRA) wrote:

>     [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533787 ]
>
> Sami Siren commented on NUTCH-565:
> ----------------------------------
>
> bq. Both jars are LGPL.
> I think that prohibits direct inclusion then. Take a look at http://people.apache.org/~rubys/3party.html for available options.
>
>> Arc File to Nutch Segments Converter
>> ------------------------------------
>>
>>                 Key: NUTCH-565
>>                 URL: https://issues.apache.org/jira/browse/NUTCH-565
>>             Project: Nutch
>>          Issue Type: Improvement
>>         Environment: all
>>            Reporter: Dennis Kubes
>>            Assignee: Dennis Kubes
>>             Fix For: 1.0.0
>>
>>         Attachments: archive-commons-1.11.0-200612262257.jar, fastutil-5.0.3-heritrix-subset-1.0.jar, nutch-565-1-20071009.patch
>>
>>
>> Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.
>
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (NUTCH-565) Arc File to Nutch Segments Converter

Sami Siren-2
Dennis Kubes wrote:
> Does this mean that they can't be attached to JIRA or that they can't be
> included in the repository or both?

My understanding is that they cannot be made part of standard/default
release. In addition to that it also says "YOU MUST NOT distribute a
prohibited work from an apache.org server.", I _think_ it also means JIRA?

There seems to be an option to add a non default build option that
retrieves such libraries (system requirement) with some additional
restrictions.


--
 Sami Siren
Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-565) Arc File to Nutch Segments Converter

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-565:
-------------------------------

    Attachment: arcsegments2.patch

Here is the updated patch.  Works without any archive.org or othe LGPL code so it can  be included in Nutch.  Since arcs a simply tars of gzips it scans through the arc file for the gzip header then when found starts input there and unzips each record in turn.  It takes about 40 min to process a single file which outputs ~1G in segments.  Multiple files can be run at once on a Hadoop cluster.

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: archive-commons-1.11.0-200612262257.jar, arcsegments2.patch, fastutil-5.0.3-heritrix-subset-1.0.jar, nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-565) Arc File to Nutch Segments Converter

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-565:
-------------------------------

    Attachment:     (was: fastutil-5.0.3-heritrix-subset-1.0.jar)

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: arcsegments2.patch, nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-565) Arc File to Nutch Segments Converter

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-565:
-------------------------------

    Attachment:     (was: archive-commons-1.11.0-200612262257.jar)

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: arcsegments2.patch, nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-565) Arc File to Nutch Segments Converter

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534364 ]

Sami Siren commented on NUTCH-565:
----------------------------------

I didn't actually test this, but it looks like useful addition to nutch, so +1 from me.

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: arcsegments2.patch, nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-565) Arc File to Nutch Segments Converter

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535772 ]

Dennis Kubes commented on NUTCH-565:
------------------------------------

If nobody has a problem with this.  I am going to commit this.

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: arcsegments2.patch, nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-565) Arc File to Nutch Segments Converter

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535943 ]

Andrzej Bialecki  commented on NUTCH-565:
-----------------------------------------

+1 overall. One question: shouldn't we put this under org.apache.nutch.tools.arc instead of creating a new top level Nutch package?

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: arcsegments2.patch, nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-565) Arc File to Nutch Segments Converter

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535960 ]

Dennis Kubes commented on NUTCH-565:
------------------------------------

Yeah, I didn't really know where to put this as it is a tool but it is also a replacement for fetcher.  If it is best under tools we can put it there.

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: arcsegments2.patch, nutch-565-1-20071009.patch
>
>
> Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-565) Arc File to Nutch Segments Converter

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-565:
-------------------------------

    Attachment: NUTCH-565-3.patch

Updated patch with javadoc and code documentation.  This patch relies on upgrading to hadoop-0.15.

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: arcsegments2.patch, nutch-565-1-20071009.patch, NUTCH-565-3.patch
>
>
> Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-565) Arc File to Nutch Segments Converter

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-565:
-------------------------------

    Attachment: NUTCH-565-3.patch

Let's try this again, this time with clicking the right button to grant the license to apache.  Previous file removed and re-added.

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: arcsegments2.patch, nutch-565-1-20071009.patch, NUTCH-565-3.patch
>
>
> Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-565) Arc File to Nutch Segments Converter

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-565:
-------------------------------

    Attachment:     (was: NUTCH-565-3.patch)

> Arc File to Nutch Segments Converter
> ------------------------------------
>
>                 Key: NUTCH-565
>                 URL: https://issues.apache.org/jira/browse/NUTCH-565
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: arcsegments2.patch, nutch-565-1-20071009.patch, NUTCH-565-3.patch
>
>
> Functionality that allows arc files, such as those produced by the internet archive project or by the Grub distributed crawler to be parsed into Nutch segments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.