The NCDC Weather Data for Hadoop the Definitive Guide

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

The NCDC Weather Data for Hadoop the Definitive Guide

Bing Li
Dear all,

I am following the book, Hadoop: the Definitive Guide. However, I got stuck
because I could not get the NCDC Weather data that is used by the source
code in the book. The Appendix C told me I could follow some instructions
in www.hadoopbook.com. But I didn't get the instructions there. Could you
give me a hand?

Thanks so much!

Best regards,
Bing
Reply | Threaded
Open this post in threaded view
|

Re: The NCDC Weather Data for Hadoop the Definitive Guide

Andy Doddington
According to Page 15 of the book, this data is available from the US National Climatic Data Center, at
http://www.ncdc.noaa.gov. Once you get to this site, there is a menu of links on the left-hand side of the
page, listed under the heading ‘Data & Products’. I suspect that the entry labelled ‘Free Data’ is the most
likely area you need to investigate :-)

Good Luck

Andy D

————————————————————

On 12 Feb 2012, at 07:14, Bing Li wrote:

> Dear all,
>
> I am following the book, Hadoop: the Definitive Guide. However, I got stuck
> because I could not get the NCDC Weather data that is used by the source
> code in the book. The Appendix C told me I could follow some instructions
> in www.hadoopbook.com. But I didn't get the instructions there. Could you
> give me a hand?
>
> Thanks so much!
>
> Best regards,
> Bing

Reply | Threaded
Open this post in threaded view
|

Re: The NCDC Weather Data for Hadoop the Definitive Guide

Bing Li
Andy,

Since there is a lot of data on the free data of the site, I cannot figure
out which one is the one talked in the book. Any format differences might
cause the source code to get exceptions. Some data is even in PDF format!

Thanks so much!
Bing

On Sun, Feb 12, 2012 at 4:35 PM, Andy Doddington <[hidden email]>wrote:

> According to Page 15 of the book, this data is available from the US
> National Climatic Data Center, at
> http://www.ncdc.noaa.gov. Once you get to this site, there is a menu of
> links on the left-hand side of the
> page, listed under the heading ‘Data & Products’. I suspect that the entry
> labelled ‘Free Data’ is the most
> likely area you need to investigate :-)
>
> Good Luck
>
> Andy D
>
> ————————————————————
>
> On 12 Feb 2012, at 07:14, Bing Li wrote:
>
> > Dear all,
> >
> > I am following the book, Hadoop: the Definitive Guide. However, I got
> stuck
> > because I could not get the NCDC Weather data that is used by the source
> > code in the book. The Appendix C told me I could follow some instructions
> > in www.hadoopbook.com. But I didn't get the instructions there. Could
> you
> > give me a hand?
> >
> > Thanks so much!
> >
> > Best regards,
> > Bing
>
>
Reply | Threaded
Open this post in threaded view
|

Re: The NCDC Weather Data for Hadoop the Definitive Guide

Andy Doddington
OK, well for starters, I think you can safely ignore the PDF data; to paraphrase Star Wars" “that isn’t the data
in which you are interested”.

Page 16 of the book describes the data format and refers to a data store that contains directories for each year from
1901 to 2001. It also shows the naming of .gz files within a sample directory (1990). The files in this directory have
names "010010-99999-1990.gz", "010014-99999-1990.gz", "010015-99999-1990.gz", and so on…

Referring back to the NCDC web site, at the link below (http://www.ncdc.noaa.gov) and clicking on the ‘Free Data’
link on the left-hand side of the screen beings up a new screen, as shown below:


Clicking again on the ‘Free Data’ link in the middle section of this page brings up another page, listing the available
data sets:


As this page notes, although some of this data needs to be paid for, there is at least one ‘free’ options within
each section. For simplicity, I went for the first one - the one labelled “3505 FTP data access” - which the comment
says is free. I used anonymous FTP and found that this site contained directories for each year from 1901 to 2012.
I expect the additional directories reflect the fact that time has moved on since the book was written :-)

There are also several text or pdf files that provide further information on the contents of the site. I suggest you
read some of these to get more details. One of these is called "ish-format-document.pdf" and it seems to describe
the document format in some detail. If you open this, you can check whether it matches the formate expected by
the hadoop sample code. There is also a ‘software’ directory, which contains various bits of code that might
prove useful.

On drilling down into the directory for 1990, I get the following list of files:


Which looks close enough to the the file names in the hadoop book - I’d guess that these are the correct files.

Given the passage of time, it is still possible that the file format has changed to make it incompatible with the
hadoop code. However, it shouldn’t be that difficult to modify the code to suit the new format (which is very
well documented, as already noted).

Good luck!

Andy

——————————————

On 12 Feb 2012, at 08:50, Bing Li wrote:

Andy,

Since there is a lot of data on the free data of the site, I cannot figure
out which one is the one talked in the book. Any format differences might
cause the source code to get exceptions. Some data is even in PDF format!

Thanks so much!
Bing

On Sun, Feb 12, 2012 at 4:35 PM, Andy Doddington <[hidden email]>wrote:

According to Page 15 of the book, this data is available from the US
National Climatic Data Center, at
http://www.ncdc.noaa.gov. Once you get to this site, there is a menu of
links on the left-hand side of the
page, listed under the heading ‘Data & Products’. I suspect that the entry
labelled ‘Free Data’ is the most
likely area you need to investigate :-)

Good Luck

Andy D

————————————————————

On 12 Feb 2012, at 07:14, Bing Li wrote:

Dear all,

I am following the book, Hadoop: the Definitive Guide. However, I got
stuck
because I could not get the NCDC Weather data that is used by the source
code in the book. The Appendix C told me I could follow some instructions
in www.hadoopbook.com. But I didn't get the instructions there. Could
you
give me a hand?

Thanks so much!

Best regards,
Bing



Reply | Threaded
Open this post in threaded view
|

Re: The NCDC Weather Data for Hadoop the Definitive Guide

Sujit Dhamale
Hi,
If Needed you can run Below Script for Storing Data on your Local System

for i in {1901..2012}
do
cd /home/ubuntu/work/
wget -r -np -nH .cut-dirs=3 -R index.html http://ftp3.ncdc.noaa.gov/pub/data/noaa/$i/
cd pub/data/noaa/$i/
cp *.gz /home/ubuntu/work/files
cd /home/ubuntu/work/
rm -r pub/
done



On Mon, Feb 13, 2012 at 3:43 PM, Andy Doddington <[hidden email]> wrote:
OK, well for starters, I think you can safely ignore the PDF data; to paraphrase Star Wars" “that isn’t the data
in which you are interested”.

Page 16 of the book describes the data format and refers to a data store that contains directories for each year from
1901 to 2001. It also shows the naming of .gz files within a sample directory (1990). The files in this directory have
names "010010-99999-1990.gz", "010014-99999-1990.gz", "010015-99999-1990.gz", and so on…

Referring back to the NCDC web site, at the link below (http://www.ncdc.noaa.gov) and clicking on the ‘Free Data’
link on the left-hand side of the screen beings up a new screen, as shown below:


Clicking again on the ‘Free Data’ link in the middle section of this page brings up another page, listing the available
data sets:


As this page notes, although some of this data needs to be paid for, there is at least one ‘free’ options within
each section. For simplicity, I went for the first one - the one labelled “3505 FTP data access” - which the comment
says is free. I used anonymous FTP and found that this site contained directories for each year from 1901 to 2012.
I expect the additional directories reflect the fact that time has moved on since the book was written :-)

There are also several text or pdf files that provide further information on the contents of the site. I suggest you
read some of these to get more details. One of these is called "ish-format-document.pdf" and it seems to describe
the document format in some detail. If you open this, you can check whether it matches the formate expected by
the hadoop sample code. There is also a ‘software’ directory, which contains various bits of code that might
prove useful.

On drilling down into the directory for 1990, I get the following list of files:


Which looks close enough to the the file names in the hadoop book - I’d guess that these are the correct files.

Given the passage of time, it is still possible that the file format has changed to make it incompatible with the
hadoop code. However, it shouldn’t be that difficult to modify the code to suit the new format (which is very
well documented, as already noted).

Good luck!

Andy

——————————————

On 12 Feb 2012, at 08:50, Bing Li wrote:

Andy,

Since there is a lot of data on the free data of the site, I cannot figure
out which one is the one talked in the book. Any format differences might
cause the source code to get exceptions. Some data is even in PDF format!

Thanks so much!
Bing

On Sun, Feb 12, 2012 at 4:35 PM, Andy Doddington <[hidden email]>wrote:

According to Page 15 of the book, this data is available from the US
National Climatic Data Center, at
http://www.ncdc.noaa.gov. Once you get to this site, there is a menu of
links on the left-hand side of the
page, listed under the heading ‘Data & Products’. I suspect that the entry
labelled ‘Free Data’ is the most
likely area you need to investigate :-)

Good Luck

Andy D

————————————————————

On 12 Feb 2012, at 07:14, Bing Li wrote:

Dear all,

I am following the book, Hadoop: the Definitive Guide. However, I got
stuck
because I could not get the NCDC Weather data that is used by the source
code in the book. The Appendix C told me I could follow some instructions
in www.hadoopbook.com. But I didn't get the instructions there. Could
you
give me a hand?

Thanks so much!

Best regards,
Bing




Reply | Threaded
Open this post in threaded view
|

Re: The NCDC Weather Data for Hadoop the Definitive Guide

Sujit Dhamale
To avoid creation of recursively folder follow below steps


1. Create one Folder in your Local drive
  i created as "/home/sujit/Desktop/Data/"

2. Create below script and run

for i in {1901..2012}
do
cd /home/sujit/Desktop/Data/
wget -r --no-parent --reject "index.html*"  http://ftp3.ncdc.noaa.gov/pub/data/noaa/$i/
done





On Fri, Nov 16, 2012 at 1:01 PM, Sujit Dhamale <[hidden email]> wrote:
Hi,
If Needed you can run Below Script for Storing Data on your Local System

for i in {1901..2012}
do
cd /home/ubuntu/work/
wget -r -np -nH .cut-dirs=3 -R index.html http://ftp3.ncdc.noaa.gov/pub/data/noaa/$i/
cd pub/data/noaa/$i/
cp *.gz /home/ubuntu/work/files
cd /home/ubuntu/work/
rm -r pub/
done



On Mon, Feb 13, 2012 at 3:43 PM, Andy Doddington <[hidden email]> wrote:
OK, well for starters, I think you can safely ignore the PDF data; to paraphrase Star Wars" “that isn’t the data
in which you are interested”.

Page 16 of the book describes the data format and refers to a data store that contains directories for each year from
1901 to 2001. It also shows the naming of .gz files within a sample directory (1990). The files in this directory have
names "010010-99999-1990.gz", "010014-99999-1990.gz", "010015-99999-1990.gz", and so on…

Referring back to the NCDC web site, at the link below (http://www.ncdc.noaa.gov) and clicking on the ‘Free Data’
link on the left-hand side of the screen beings up a new screen, as shown below:


Clicking again on the ‘Free Data’ link in the middle section of this page brings up another page, listing the available
data sets:


As this page notes, although some of this data needs to be paid for, there is at least one ‘free’ options within
each section. For simplicity, I went for the first one - the one labelled “3505 FTP data access” - which the comment
says is free. I used anonymous FTP and found that this site contained directories for each year from 1901 to 2012.
I expect the additional directories reflect the fact that time has moved on since the book was written :-)

There are also several text or pdf files that provide further information on the contents of the site. I suggest you
read some of these to get more details. One of these is called "ish-format-document.pdf" and it seems to describe
the document format in some detail. If you open this, you can check whether it matches the formate expected by
the hadoop sample code. There is also a ‘software’ directory, which contains various bits of code that might
prove useful.

On drilling down into the directory for 1990, I get the following list of files:


Which looks close enough to the the file names in the hadoop book - I’d guess that these are the correct files.

Given the passage of time, it is still possible that the file format has changed to make it incompatible with the
hadoop code. However, it shouldn’t be that difficult to modify the code to suit the new format (which is very
well documented, as already noted).

Good luck!

Andy

——————————————

On 12 Feb 2012, at 08:50, Bing Li wrote:

Andy,

Since there is a lot of data on the free data of the site, I cannot figure
out which one is the one talked in the book. Any format differences might
cause the source code to get exceptions. Some data is even in PDF format!

Thanks so much!
Bing

On Sun, Feb 12, 2012 at 4:35 PM, Andy Doddington <[hidden email]>wrote:

According to Page 15 of the book, this data is available from the US
National Climatic Data Center, at
http://www.ncdc.noaa.gov. Once you get to this site, there is a menu of
links on the left-hand side of the
page, listed under the heading ‘Data & Products’. I suspect that the entry
labelled ‘Free Data’ is the most
likely area you need to investigate :-)

Good Luck

Andy D

————————————————————

On 12 Feb 2012, at 07:14, Bing Li wrote:

Dear all,

I am following the book, Hadoop: the Definitive Guide. However, I got
stuck
because I could not get the NCDC Weather data that is used by the source
code in the book. The Appendix C told me I could follow some instructions
in www.hadoopbook.com. But I didn't get the instructions there. Could
you
give me a hand?

Thanks so much!

Best regards,
Bing