Urlfilter bug (doesn't return on long URLs)

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Urlfilter bug (doesn't return on long URLs)

Rod Taylor-2
I stuck a few log statements within ParseOutputFormat.java. One after
'String toUrl =' and another before the 'if (toUrl != null)'. Nutch came
across a URL which hit the first but not the second.

This means it is getting stuck (no exit or error, eventually the process
times out and is reattempted to fail exactly the same way).

The URL it is trying to process at the time is very long and somewhat
convoluted. The thread is idle. Adding a restriction to skip URLs longer
than 512 characters seems to have solved it.

4096 characters long
http://www.moveandstay.com/aberdeen/::abilene/::addison/::adelaide/::1076::1042::aix_en_provence/::alexandria/::algarve/::alpharetta/::1077::amalfi_coast/::amersham/::amsterdam/::arlington/::ashgrove/::atlanta/::1080::auckland/::austin/::707::bali/::1102::bangalore/::bangkok/::1037::barcelona/::beachwood/::bedminster/::beijing/::bellevue/::belo_horizonte/::berlin/::bethesda/::beverly_hills/::1068::1082::birmingham/::birmingham/::blois/::bloomfield_hills/::boca_raton/::bogota/::bohemia/::960::bonn/::bordeaux/::boston/::bothell/::brasilia/::1145::brest/::bridgewater/::brisbane/::bristol/::brookfield/::broomfield/::brussels/::budapest/::buffalo/::burlington/::cairns/::cambridge/::cambridge/::campbell/::campinas/::canberra/::cape_town/::1040::caracas/::cardiff/::1114::carlsbad/::carlton/::century_city/::cerritos/::1061::charlotte/::cheltenham/::1016::chicago/::chonburi/::christchurch/::308::cincinnati/::cleveland/::cologne/::compiegne/::1079::coral_gables/::costa_mesa/::crete/::culver_city/::curitiba/::1064::1098::1166::dallas/::dandenong/::darwin/::denver/::1063::doncaster/::dortmund/::dubai/::dublin/::dublin/::durham/::195::east_brunswick/::east_sicily/::edina/::edinburgh/::englewood/::erlanger/::essen/::fairfax/::farmington/::fitzroy/::florence/::1090::framingham/::frankfurt/::freehold/::frisco/::1127::979::glasgow/::glendale/::1133::gold_river/::1084::greenwood_village/::1091::guadalajara/::guangzhou/::1170::hamburg/::hanoi/::1132::hauppage/::henderson/::ho_chi_minh_city/::hobart/::hongkong/::houston/::huntington_beach/::1089::independence/::indianapolis/::1059::irvine/::irvine/::irving/::iselin/::671::162::jacksonville/::jakarta/::1113::jersey_city/::johannesburg/::jolimont/::kennesaw/::kew/::king_of_prussia/::kirkland/::krabi/::kuala_lumpur/::673::1185::la_jolla/::la_mirada/::1085::la_rochelle/::lago_maggiore/::laguna_hills/::1144::lake_oswego/::lannion/::1087::1159::las_vegas/::le_mans/::leeds/::lille/::lisbon/::lisle/::london/::long_beach/::los_angeles/::lyon/::1143::1021::963::madrid/::mahwah/::maidenhead/::1067::maitland/::1088::1025::manchester/::1081::mandurah/::manhattan_beach/::manila/::1078::732::1044::1105::marseille/::mclean/::melbourne/::melville/::mexico_city/::miami/::michigan/::milan/::458::minneapolis/::minnetonka/::monterrey/::montpellier/::montreal/::morristown/::1130::686::mountain_view/::mt._laurel/::mumbai/::munich/::nagoya/::nancy/::nantes/::naples/::narre_warren/::nashville/::new_delhi/::new_york/::newark/::newcastle/::newport_beach/::newtown/::199::norcross/::northbrook/::nottingham/::novi/::191::oak_brook/::oakbrook_terrace/::orange/::orlando/::osaka/::1186::overland_park/::131::palatine/::paris/::parnell/::parsippany/::pasadena/::pattaya/::1060::perth/::1120::philadelphia/::phoenix/::phuket/::pittsburgh/::plantation/::pleasanton/::ponsoby/::portland/::porto_alegre/::1123::positano/::prague/::prahran/::1106::princeton/::1058::puglia/::rancho_santa_margarita/::rayong/::reading/::red_bank/::redmond/::rennes/::reston/::rio_de_janeiro/::693::rolling_meadows/::rome/::rosemont/::roseville/::sacramento/::saddle_brook/::1072::saint-nazaire/::1083::salvador/::1115::1029::san_antonio/::san_diego/::san_francisco/::san_jose/::san_juan/::san_mateo/::san_rafael/::san_ramon/::santa_clara/::564::sao_polo/::1049::1062::1118::schaumburg/::scottsdale/::570::seattle/::seoul/::shanghai/::1160::short_hills/::1108::singapore/::sofia/::sophia_antipolis/::sorrento/::1117::579::southfield/::1066::st_kilda/::st_louis/::stockholm/::strasbourg/::192::sun_city/::sunrise/::surabaya/::sydney/::syosset/::1126::1158::tampa/::tarrytown/::taupo/::the_entrance/::tokyo/::toronto/::1069::toulouse/::trinity_beach/::troy/::tulsa/::tuscany_cities/::1055::tuscany_seaside/::tustin/::1134::umbria/::uniondale/::vancouver/::venice/::verona/::victoria/::vienna/::vienna/::1111::1110::walnut_creek/::waltham/::wantirna/::warrenville/::warsaw/::washington_dc/::1128::wellesley_hills/::wellington/::west_chester/::west_sicily/::white_plains/::wiesbaden/::williamstown/::windsor/::woodland_hills/::worthington/::948::zurich/


Index: ParseOutputFormat.java
===================================================================
--- ParseOutputFormat.java (revision 344015)
+++ ParseOutputFormat.java (working copy)
@@ -56,7 +56,7 @@
 
         public void write(WritableComparable key, Writable value)
           throws IOException {
-          
+
           Parse parse = (Parse)value;
           
           textOut.append(key, new ParseText(parse.getText()));
@@ -73,6 +73,10 @@
           for (int i = 0; i < links.length; i++) {
             String toUrl = links[i].getToUrl();
             try {
+              if (toUrl.length() > 512) {
+                 throw new Exception("URL length too long: " +
toUrl.length() +" characters");
+              }
+
               toUrl = urlNormalizer.normalize(toUrl); // normalize the
url
               toUrl = URLFilters.filter(toUrl);   // filter the url
             } catch (Exception e) {

--
Rod Taylor <[hidden email]>

Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter bug (doesn't return on long URLs)

Doug Cutting-2
This sounds like a bug in the URLFilter implementation.  Is this
RegexURLFilter?  Can you figure out what regex is causing this?
Probably the patch should be there, no?

Doug

Rod Taylor wrote:

> I stuck a few log statements within ParseOutputFormat.java. One after
> 'String toUrl =' and another before the 'if (toUrl != null)'. Nutch came
> across a URL which hit the first but not the second.
>
> This means it is getting stuck (no exit or error, eventually the process
> times out and is reattempted to fail exactly the same way).
>
> The URL it is trying to process at the time is very long and somewhat
> convoluted. The thread is idle. Adding a restriction to skip URLs longer
> than 512 characters seems to have solved it.
>
> 4096 characters long
> http://www.moveandstay.com/aberdeen/::abilene/::addison/::adelaide/::1076::1042::aix_en_provence/::alexandria/::algarve/::alpharetta/::1077::amalfi_coast/::amersham/::amsterdam/::arlington/::ashgrove/::atlanta/::1080::auckland/::austin/::707::bali/::1102::bangalore/::bangkok/::1037::barcelona/::beachwood/::bedminster/::beijing/::bellevue/::belo_horizonte/::berlin/::bethesda/::beverly_hills/::1068::1082::birmingham/::birmingham/::blois/::bloomfield_hills/::boca_raton/::bogota/::bohemia/::960::bonn/::bordeaux/::boston/::bothell/::brasilia/::1145::brest/::bridgewater/::brisbane/::bristol/::brookfield/::broomfield/::brussels/::budapest/::buffalo/::burlington/::cairns/::cambridge/::cambridge/::campbell/::campinas/::canberra/::cape_town/::1040::caracas/::cardiff/::1114::carlsbad/::carlton/::century_city/::cerritos/::1061::charlotte/::cheltenham/::1016::chicago/::chonburi/::christchurch/::308::cincinnati/::cleveland/::cologne/::compiegne/::1079::coral_gables/::costa_mesa/::crete/:
:culver_city/::curitiba/::1064::1098::1166::dallas/::dandenong/::darwin/::denver/::1063::doncaster/::dortmund/::dubai/::dublin/::dublin/::durham/::195::east_brunswick/::east_sicily/::edina/::edinburgh/::englewood/::erlanger/::essen/::fairfax/::farmington/::fitzroy/::florence/::1090::framingham/::frankfurt/::freehold/::frisco/::1127::979::glasgow/::glendale/::1133::gold_river/::1084::greenwood_village/::1091::guadalajara/::guangzhou/::1170::hamburg/::hanoi/::1132::hauppage/::henderson/::ho_chi_minh_city/::hobart/::hongkong/::houston/::huntington_beach/::1089::independence/::indianapolis/::1059::irvine/::irvine/::irving/::iselin/::671::162::jacksonville/::jakarta/::1113::jersey_city/::johannesburg/::jolimont/::kennesaw/::kew/::king_of_prussia/::kirkland/::krabi/::kuala_lumpur/::673::1185::la_jolla/::la_mirada/::1085::la_rochelle/::lago_maggiore/::laguna_hills/::1144::lake_oswego/::lannion/::1087::1159::las_vegas/::le_mans/::leeds/::lille/::lisbon/::lisle/::london/::long_beach/:
:los_angeles/::lyon/::1143::1021::963::madrid/::mahwah/::maidenhead/::1067::maitland/::1088::1025::manchester/::1081::mandurah/::manhattan_beach/::manila/::1078::732::1044::1105::marseille/::mclean/::melbourne/::melville/::mexico_city/::miami/::michigan/::milan/::458::minneapolis/::minnetonka/::monterrey/::montpellier/::montreal/::morristown/::1130::686::mountain_view/::mt._laurel/::mumbai/::munich/::nagoya/::nancy/::nantes/::naples/::narre_warren/::nashville/::new_delhi/::new_york/::newark/::newcastle/::newport_beach/::newtown/::199::norcross/::northbrook/::nottingham/::novi/::191::oak_brook/::oakbrook_terrace/::orange/::orlando/::osaka/::1186::overland_park/::131::palatine/::paris/::parnell/::parsippany/::pasadena/::pattaya/::1060::perth/::1120::philadelphia/::phoenix/::phuket/::pittsburgh/::plantation/::pleasanton/::ponsoby/::portland/::porto_alegre/::1123::positano/::prague/::prahran/::1106::princeton/::1058::puglia/::rancho_santa_margarita/::rayong/::reading/::red_bank/:
:redmond/::rennes/::reston/::rio_de_janeiro/::693::rolling_meadows/::rome/::rosemont/::roseville/::sacramento/::saddle_brook/::1072::saint-nazaire/::1083::salvador/::1115::1029::san_antonio/::san_diego/::san_francisco/::san_jose/::san_juan/::san_mateo/::san_rafael/::san_ramon/::santa_clara/::564::sao_polo/::1049::1062::1118::schaumburg/::scottsdale/::570::seattle/::seoul/::shanghai/::1160::short_hills/::1108::singapore/::sofia/::sophia_antipolis/::sorrento/::1117::579::southfield/::1066::st_kilda/::st_louis/::stockholm/::strasbourg/::192::sun_city/::sunrise/::surabaya/::sydney/::syosset/::1126::1158::tampa/::tarrytown/::taupo/::the_entrance/::tokyo/::toronto/::1069::toulouse/::trinity_beach/::troy/::tulsa/::tuscany_cities/::1055::tuscany_seaside/::tustin/::1134::umbria/::uniondale/::vancouver/::venice/::verona/::victoria/::vienna/::vienna/::1111::1110::walnut_creek/::waltham/::wantirna/::warrenville/::warsaw/::washington_dc/::1128::wellesley_hills/::wellington/::west_chester/
::west_sicily/::white_plains/::wiesbaden/::williamstown/::windsor/::woodland_hills/::worthington/::948::zurich/

>
>
> Index: ParseOutputFormat.java
> ===================================================================
> --- ParseOutputFormat.java (revision 344015)
> +++ ParseOutputFormat.java (working copy)
> @@ -56,7 +56,7 @@
>  
>          public void write(WritableComparable key, Writable value)
>            throws IOException {
> -          
> +
>            Parse parse = (Parse)value;
>            
>            textOut.append(key, new ParseText(parse.getText()));
> @@ -73,6 +73,10 @@
>            for (int i = 0; i < links.length; i++) {
>              String toUrl = links[i].getToUrl();
>              try {
> +              if (toUrl.length() > 512) {
> +                 throw new Exception("URL length too long: " +
> toUrl.length() +" characters");
> +              }
> +
>                toUrl = urlNormalizer.normalize(toUrl); // normalize the
> url
>                toUrl = URLFilters.filter(toUrl);   // filter the url
>              } catch (Exception e) {
>
Reply | Threaded
Open this post in threaded view
|

Re: Urlfilter bug (doesn't return on long URLs)

Rod Taylor-2
On Mon, 2005-11-21 at 15:11 -0800, Doug Cutting wrote:
> This sounds like a bug in the URLFilter implementation.  Is this
> RegexURLFilter?  Can you figure out what regex is causing this?
> Probably the patch should be there, no?

I am using the URL Filtering and normalization plugins. As to where the
patch should go I didn't dig any deeper than this, so that is where I
applied it against my own system to prevent it from breaking.

I put the URL into a file and tried using it as a seed for nutch crawl.
The lockup occurred.

I commented out all entries from regex-urlfilter.txt and
crawl-urlfilter.txt except for the "+." line at the end. The
lockup/timeout still occurred.  Commenting out the full contents of
regex-normalize.xml does not change the outcome either.

Attached is the seed file I used and had in the seeds directory

-bash-2.05b$ ./bin/nutch crawl /home/rbt/nutch-0.8_10/test/seed/
-dir /home/rbt/nutch-0.8_10/test/test3 -topN 1
051122 120048 parsing file:/home/rbt/nutch-0.8_10/conf/nutch-default.xml
051122 120049 parsing file:/home/rbt/nutch-0.8_10/conf/crawl-tool.xml
051122 120049 parsing
file:/home/rbt/nutch-0.8_10/conf/mapred-default.xml
051122 120049 parsing file:/home/rbt/nutch-0.8_10/conf/nutch-site.xml
051122 120049 crawl started in: /home/rbt/nutch-0.8_10/test/test3
051122 120049 rootUrlFile = /home/rbt/nutch-0.8_10/test/seed
051122 120049 threads = 45
051122 120049 depth = 5
051122 120049 topN = 1
051122 120049 parsing file:/home/rbt/nutch-0.8_10/conf/nutch-default.xml
051122 120049 parsing file:/home/rbt/nutch-0.8_10/conf/crawl-tool.xml
051122 120049 parsing file:/home/rbt/nutch-0.8_10/conf/nutch-site.xml
051122 120049 Injector: starting
051122 120049 Injector:
crawlDb: /home/rbt/nutch-0.8_10/test/test3/crawldb
051122 120049 Injector: urlDir: /home/rbt/nutch-0.8_10/test/seed
051122 120049 Injector: Converting injected urls to crawl db entries.
051122 120049 parsing file:/home/rbt/nutch-0.8_10/conf/nutch-default.xml
051122 120049 parsing file:/home/rbt/nutch-0.8_10/conf/crawl-tool.xml
051122 120049 parsing
file:/home/rbt/nutch-0.8_10/conf/mapred-default.xml
051122 120049 parsing
file:/home/rbt/nutch-0.8_10/conf/mapred-default.xml
051122 120049 parsing file:/home/rbt/nutch-0.8_10/conf/nutch-site.xml


Exception in thread "main" java.net.ConnectException: Connection timed
out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:305)
        at
java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:171)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:158)
        at java.net.Socket.connect(Socket.java:452)
        at java.net.Socket.connect(Socket.java:402)
        at java.net.Socket.<init>(Socket.java:309)
        at java.net.Socket.<init>(Socket.java:153)
        at org.apache.nutch.ipc.Client
$Connection.<init>(Client.java:110)
        at org.apache.nutch.ipc.Client.getConnection(Client.java:343)
        at org.apache.nutch.ipc.Client.call(Client.java:281)
        at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
        at $Proxy0.getFilesystemName(Unknown Source)
        at org.apache.nutch.mapred.JobClient.getFs(JobClient.java:209)
        at
org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:249)
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:102)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:101)


> Rod Taylor wrote:
> > I stuck a few log statements within ParseOutputFormat.java. One after
> > 'String toUrl =' and another before the 'if (toUrl != null)'. Nutch came
> > across a URL which hit the first but not the second.
> >
> > This means it is getting stuck (no exit or error, eventually the process
> > times out and is reattempted to fail exactly the same way).
> >
> > The URL it is trying to process at the time is very long and somewhat
> > convoluted. The thread is idle. Adding a restriction to skip URLs longer
> > than 512 characters seems to have solved it.
> >
> > 4096 characters long
> > http://www.moveandstay.com/aberdeen/::abilene/::addison/::adelaide/::1076::1042::aix_en_provence/::alexandria/::algarve/::alpharetta/::1077::amalfi_coast/::amersham/::amsterdam/::arlington/::ashgrove/::atlanta/::1080::auckland/::austin/::707::bali/::1102::bangalore/::bangkok/::1037::barcelona/::beachwood/::bedminster/::beijing/::bellevue/::belo_horizonte/::berlin/::bethesda/::beverly_hills/::1068::1082::birmingham/::birmingham/::blois/::bloomfield_hills/::boca_raton/::bogota/::bohemia/::960::bonn/::bordeaux/::boston/::bothell/::brasilia/::1145::brest/::bridgewater/::brisbane/::bristol/::brookfield/::broomfield/::brussels/::budapest/::buffalo/::burlington/::cairns/::cambridge/::cambridge/::campbell/::campinas/::canberra/::cape_town/::1040::caracas/::cardiff/::1114::carlsbad/::carlton/::century_city/::cerritos/::1061::charlotte/::cheltenham/::1016::chicago/::chonburi/::christchurch/::308::cincinnati/::cleveland/::cologne/::compiegne/::1079::coral_gables/::costa_mesa/::crete
 /:
> :culver_city/::curitiba/::1064::1098::1166::dallas/::dandenong/::darwin/::denver/::1063::doncaster/::dortmund/::dubai/::dublin/::dublin/::durham/::195::east_brunswick/::east_sicily/::edina/::edinburgh/::englewood/::erlanger/::essen/::fairfax/::farmington/::fitzroy/::florence/::1090::framingham/::frankfurt/::freehold/::frisco/::1127::979::glasgow/::glendale/::1133::gold_river/::1084::greenwood_village/::1091::guadalajara/::guangzhou/::1170::hamburg/::hanoi/::1132::hauppage/::henderson/::ho_chi_minh_city/::hobart/::hongkong/::houston/::huntington_beach/::1089::independence/::indianapolis/::1059::irvine/::irvine/::irving/::iselin/::671::162::jacksonville/::jakarta/::1113::jersey_city/::johannesburg/::jolimont/::kennesaw/::kew/::king_of_prussia/::kirkland/::krabi/::kuala_lumpur/::673::1185::la_jolla/::la_mirada/::1085::la_rochelle/::lago_maggiore/::laguna_hills/::1144::lake_oswego/::lannion/::1087::1159::las_vegas/::le_mans/::leeds/::lille/::lisbon/::lisle/::london/::long_beach
 /:
> :los_angeles/::lyon/::1143::1021::963::madrid/::mahwah/::maidenhead/::1067::maitland/::1088::1025::manchester/::1081::mandurah/::manhattan_beach/::manila/::1078::732::1044::1105::marseille/::mclean/::melbourne/::melville/::mexico_city/::miami/::michigan/::milan/::458::minneapolis/::minnetonka/::monterrey/::montpellier/::montreal/::morristown/::1130::686::mountain_view/::mt._laurel/::mumbai/::munich/::nagoya/::nancy/::nantes/::naples/::narre_warren/::nashville/::new_delhi/::new_york/::newark/::newcastle/::newport_beach/::newtown/::199::norcross/::northbrook/::nottingham/::novi/::191::oak_brook/::oakbrook_terrace/::orange/::orlando/::osaka/::1186::overland_park/::131::palatine/::paris/::parnell/::parsippany/::pasadena/::pattaya/::1060::perth/::1120::philadelphia/::phoenix/::phuket/::pittsburgh/::plantation/::pleasanton/::ponsoby/::portland/::porto_alegre/::1123::positano/::prague/::prahran/::1106::princeton/::1058::puglia/::rancho_santa_margarita/::rayong/::reading/::red_bank
 /:
> :redmond/::rennes/::reston/::rio_de_janeiro/::693::rolling_meadows/::rome/::rosemont/::roseville/::sacramento/::saddle_brook/::1072::saint-nazaire/::1083::salvador/::1115::1029::san_antonio/::san_diego/::san_francisco/::san_jose/::san_juan/::san_mateo/::san_rafael/::san_ramon/::santa_clara/::564::sao_polo/::1049::1062::1118::schaumburg/::scottsdale/::570::seattle/::seoul/::shanghai/::1160::short_hills/::1108::singapore/::sofia/::sophia_antipolis/::sorrento/::1117::579::southfield/::1066::st_kilda/::st_louis/::stockholm/::strasbourg/::192::sun_city/::sunrise/::surabaya/::sydney/::syosset/::1126::1158::tampa/::tarrytown/::taupo/::the_entrance/::tokyo/::toronto/::1069::toulouse/::trinity_beach/::troy/::tulsa/::tuscany_cities/::1055::tuscany_seaside/::tustin/::1134::umbria/::uniondale/::vancouver/::venice/::verona/::victoria/::vienna/::vienna/::1111::1110::walnut_creek/::waltham/::wantirna/::warrenville/::warsaw/::washington_dc/::1128::wellesley_hills/::wellington/::west_cheste
 r/

> ::west_sicily/::white_plains/::wiesbaden/::williamstown/::windsor/::woodland_hills/::worthington/::948::zurich/
> >
> >
> > Index: ParseOutputFormat.java
> > ===================================================================
> > --- ParseOutputFormat.java (revision 344015)
> > +++ ParseOutputFormat.java (working copy)
> > @@ -56,7 +56,7 @@
> >  
> >          public void write(WritableComparable key, Writable value)
> >            throws IOException {
> > -          
> > +
> >            Parse parse = (Parse)value;
> >            
> >            textOut.append(key, new ParseText(parse.getText()));
> > @@ -73,6 +73,10 @@
> >            for (int i = 0; i < links.length; i++) {
> >              String toUrl = links[i].getToUrl();
> >              try {
> > +              if (toUrl.length() > 512) {
> > +                 throw new Exception("URL length too long: " +
> > toUrl.length() +" characters");
> > +              }
> > +
> >                toUrl = urlNormalizer.normalize(toUrl); // normalize the
> > url
> >                toUrl = URLFilters.filter(toUrl);   // filter the url
> >              } catch (Exception e) {
> >
>
--
Rod Taylor <[hidden email]>