Hi all,

I have a weird problem, which is pretty much entirely out of my job description, but it falls to me to get it done, and I am entirely out of my depth.

My company has an old website which we have lost all details for, but we urgently need to back up the content from here, and out it on a new host, as we are moving all hosts.

I am looking for a way to download the entire website, but I need to be able to download the entire contents to a csv file of all the url’s. I need it this way for our records, so don;t really need an offline copy of the website, just all the url’s contained within.

I have tried a few website scrapers to no avail, and I have played around a bit with wget for windows, which looks like it should do what I need, but am not having any luck. There is a lot of advise on wget and lynx online, but so far neither quite do what I need.

Any help, as always, would be very gratefully received.

Thanks

1 Spice up

Presumably it’s hosted somewhere so it’s just a matter of contacting the host and explaining the situation?

2 Spice ups

You’d think so. Unfortunately, it is tied to an old IT admin we had here, and they only recognise his name. It’s one that slipped through the net a long time before I started here, and was an acquisition of another company, so they don’t even recognise our company.This seemed the only option, but is now also proving very tricky!

I’m a bit confused. Are you saying you just need a copy of the URLs that the site contains?

My plan would be:

  1. Use wget recursively to pull down all static content
  2. Do Clever Things (depending on what you know) to search through all the files for <a href=" and save the resulting links
  3. Beer
1 Spice up

Beer sounds good, unfortunately, the steps before are where I’m having a problem :slight_smile:

Wget I’m a complete newbie at, but I believed there was a way to scan the website and automatically output the files to csv. Or am I trying to be too clever, or not clever enough (which is much more likely).

WGet is a good idea, have a read of this → Downloading an Entire Web Site with wget | Linux Journal

1 Spice up

I’m still confused as to what your final aim is here.

What are you expecting the CSV file to contain? A list of pages on the website? A list of links on the website?

A list of links on the website, ideally. I know it will be quite a mess, and I will need to tidy that up, but it’s a start.

This is what I’ve used many times and it’s worked great…

What wget command line are using? Have you tried the spider option?

wget --spider --force-html -r -l1 http://somesite.com

http://stackoverflow.com/questions/2804467/spider-a-website-and-return-urls-only

http://somesite.com

1 Spice up

I haven’t no, I’ll give that a go, thanks.

I take it you don’t have access via ftp? :confused:

wget -r --no-parent http://site.com/

^You can run wget from Windows if need be too but wget and HTTRACK are really your two options without having any access or creds for directories.