Recursively downloading a website in Windows

danielgrimes · 2015-08-06T06:16:19.000Z

Hi all,

I have a weird problem, which is pretty much entirely out of my job description, but it falls to me to get it done, and I am entirely out of my depth.

My company has an old website which we have lost all details for, but we urgently need to back up the content from here, and out it on a new host, as we are moving all hosts.

I am looking for a way to download the entire website, but I need to be able to download the entire contents to a csv file of all the url’s. I need it this way for our records, so don;t really need an offline copy of the website, just all the url’s contained within.

I have tried a few website scrapers to no avail, and I have played around a bit with wget for windows, which looks like it should do what I need, but am not having any luck. There is a lot of advise on wget and lynx online, but so far neither quite do what I need.

Any help, as always, would be very gratefully received.

Thanks

Gary-D-Williams · 2015-08-06T06:25:04.000Z

danielgrimes:

My company has an old website which we have lost all details for, but we urgently need to back up the content from here, and out it on a new host, as we are moving all hosts.

Presumably it’s hosted somewhere so it’s just a matter of contacting the host and explaining the situation?

danielgrimes · 2015-08-06T06:27:46.000Z

You’d think so. Unfortunately, it is tied to an old IT admin we had here, and they only recognise his name. It’s one that slipped through the net a long time before I started here, and was an acquisition of another company, so they don’t even recognise our company.This seemed the only option, but is now also proving very tricky!

joewilliams · 2015-08-06T06:39:20.000Z

danielgrimes:

I am looking for a way to download the entire website, but I need to be able to download the entire contents to a csv file of all the url’s. I need it this way for our records, so don;t really need an offline copy of the website, just all the url’s contained within.

I’m a bit confused. Are you saying you just need a copy of the URLs that the site contains?

My plan would be:

Use wget recursively to pull down all static content
Do Clever Things (depending on what you know) to search through all the files for <a href=" and save the resulting links
Beer

danielgrimes · 2015-08-06T06:45:13.000Z

Beer sounds good, unfortunately, the steps before are where I’m having a problem

Wget I’m a complete newbie at, but I believed there was a way to scan the website and automatically output the files to csv. Or am I trying to be too clever, or not clever enough (which is much more likely).

Gary-D-Williams · 2015-08-06T07:06:17.000Z

WGet is a good idea, have a read of this → Downloading an Entire Web Site with wget | Linux Journal

joewilliams · 2015-08-06T07:06:54.000Z

danielgrimes:

Wget I’m a complete newbie at, but I believed there was a way to scan the website and automatically output the files to csv. Or am I trying to be too clever, or not clever enough (which is much more likely).

I’m still confused as to what your final aim is here.

What are you expecting the CSV file to contain? A list of pages on the website? A list of links on the website?

danielgrimes · 2015-08-06T07:08:31.000Z

A list of links on the website, ideally. I know it will be quite a mess, and I will need to tidy that up, but it’s a start.

legoman · 2015-08-08T00:22:41.000Z

This is what I’ve used many times and it’s worked great…

httrack.com

HTTrack Website Copier - Free Software Offline Browser (GNU GPL)

HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility. It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files...

matthewduffy6331 · 2015-08-21T04:19:24.000Z

What wget command line are using? Have you tried the spider option?

wget --spider --force-html -r -l1 http://somesite.com

http://stackoverflow.com/questions/2804467/spider-a-website-and-return-urls-only

http://somesite.com

danielgrimes · 2015-08-21T05:08:08.000Z

I haven’t no, I’ll give that a go, thanks.

zuphzuph · 2015-08-25T14:02:08.000Z

I take it you don’t have access via ftp?

wget -r --no-parent http://site.com/

^You can run wget from Windows if need be too but wget and HTTRACK are really your two options without having any access or creds for directories.