Writing Nutch data in Parquet format

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Writing Nutch data in Parquet format

lewis john mcgibbney-2
Hi user@,
Has anyone experimented/accomplished either
1) writing Nutch data directly as Parquet format, or
2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet format?
Thank you
lewismc
Reply | Threaded
Open this post in threaded view
|

Re: Writing Nutch data in Parquet format

Sebastian Nagel-2
Hi Lewis,

 > 2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet format?

Yes, but not directly - it's a multi-step process. The outcome:
   https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

This Parquet index is optimized by sorting the rows by a special form of the URL [1] which
- drops the protocol or scheme
- reverses the host name and
- puts it in front of the remaining URL parts (path and query)
- with some additional normalization of path and query (eg. sorting of query params)

One example:
   https://example.com/path/search?q=foo&l=en
   com,example)/path/search?l=en&q=foo

The SURT URL is similar to the URL format used by Nutch2
   com.example/https/path/search?q=foo&l=en
to address rows in the WebPage table [2]. This format is inspired by the BigTable
paper [3].  The point is that  cf. [4].


Ok, back to the question: both 1) and 2) are trivial if you do not care about
writing an optimal Parquet files: just define a schema following the methods implementing
the Writable interface. Parquet is easier to feed into various data processing systems
because it integrates the schema. The Sequence file format requires that the
Writable formats are provided - although Spark and other big data tools support
Sequence files this requirement is sometimes a blocker, also because Nutch
does not ship a small "nutch-formats" jar.

Nevertheless, the price for Parquet is slower writing - which is ok for write-once-read-many
use cases. But the typical use case for Nutch is "write-once-read-twice":
- segment: read for CrawlDb update and indexing
- CrawlDb: read during update then replace, in some cycles read for deduplication, statistics, etc.


Lewis, I'd be really interested what your particular use case is?

Also because at Common Crawl we plan to provide more data in the Parquet format: page metadata,
links and text dumps. Storing URLs and wb page metadata efficiently was part of the motivation
for Dremel [5] which again inspired Parquet [6].


Best,
Sebastian


[1] https://github.com/internetarchive/surt
[2] https://cwiki.apache.org/confluence/display/NUTCH/Nutch2Crawling#Nutch2Crawling-Introduction
[3] https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
[4] https://cloud.google.com/bigtable/docs/schema-design#domain-names
[5] https://research.google/pubs/pub36632/
[6] https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html


On 5/4/21 11:14 PM, Lewis John McGibbney wrote:
> Hi user@,
> Has anyone experimented/accomplished either
> 1) writing Nutch data directly as Parquet format, or
> 2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet format?
> Thank you
> lewismc
>

Reply | Threaded
Open this post in threaded view
|

Re: Writing Nutch data in Parquet format

lewis john mcgibbney-2
Hi Seb,
Really interesting. Thanks for the response. Below....

On 2021/05/05 11:42:04, Sebastian Nagel <[hidden email]> wrote:
>
> Yes, but not directly - it's a multi-step process.

As I expected ;)

>
> This Parquet index is optimized by sorting the rows by a special form of the URL [1] which
> - drops the protocol or scheme
> - reverses the host name and
> - puts it in front of the remaining URL parts (path and query)
> - with some additional normalization of path and query (eg. sorting of query params)
>
> One example:
>    https://example.com/path/search?q=foo&l=en
>    com,example)/path/search?l=en&q=foo
>
> The SURT URL is similar to the URL format used by Nutch2
>    com.example/https/path/search?q=foo&l=en
> to address rows in the WebPage table [2]. This format is inspired by the BigTable
> paper [3].  The point is that  cf. [4].

OK, I recognize this data model. Seems logical.

> Ok, back to the question: both 1) and 2) are trivial if you do not care about
> writing an optimal Parquet files: just define a schema following the methods implementing
> the Writable interface. Parquet is easier to feed into various data processing systems
> because it integrates the schema. The Sequence file format requires that the
> Writable formats are provided - although Spark and other big data tools support
> Sequence files this requirement is sometimes a blocker, also because Nutch
> does not ship a small "nutch-formats" jar.

In my case, the purpose of writing Nutch (Hadoop sequence file) data to Parquet format was to facilitate (improved) analytics within the Databricks platform which we are currently evaluating.
I'm hesitant to re-use the word 'optimal' because I have not yet benchmarked any retrievals but I 'hope' that I can begin to work on 'optimizing' the way that Nutch data is written such that it can be analyzed with relative ease within, for example Databricks.

>
> Nevertheless, the price for Parquet is slower writing - which is ok for write-once-read-many
> use cases.

Yes, this is our use case.

> But the typical use case for Nutch is "write-once-read-twice":
> - segment: read for CrawlDb update and indexing
> - CrawlDb: read during update then replace, in some cycles read for deduplication, statistics, etc.

So sequence files are optimal for use within the Nutch system but for additional analytics (on outside platforms such as Databricks) I suspect that Parquet would be preferred.

Maybe we can share more ideas. I wonder if a utility tool to write segments as Parquet data would be useful?

Thanks Seb