Content truncated while using commoncrawldump

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Content truncated while using commoncrawldump

I am currently attempting to dump the contents of a crawl into multiple
WARC files using

./bin/nutch commoncrawldump -outputDir nameOfOutputDir -segment
crawl/segments/segmentDir -warc

However, I get multiple occurrences of

URL skipped. Content of size X was truncated to Y.

I have set both http.content.limit and file.content.limit to -1 in order
to remove any limits, but I'm guessing neither applies to this
situation. Any way of removing said cap?