Issue with wget for crawling and scraping...

cyberspice82 · 2022-12-27T06:41:33.000Z

Morning all,

I am following a book called Network Security Assessment and I am stuck on a particular section.

The author mentions wget for crawling and scraping a website, this sounds like it could be quite useful, however, the command provided in the book does not work as expected.

The author says to use the following command to crawl and scrape the entire contents of a website.

wget -r -m -nv http://www.example.org

Then use the tree command which should show all the pages within the website.

The above wget command only downloads the index.html file, it does not download all files.

I have tried to use the wget man pages but cannot had no luck finding a solution.

Has anyone seen the above before?

Many thanks

peterw2300 · 2022-12-27T08:06:05.000Z

It could be that the website is restricting your activities and there are examples of how to get that command working here: Download a whole website with wget (or other) including all its downloadable content - Ask Ubuntu

You can also remove the -nv so that you can see what is happening. To get the full range of options type “wget --help”

jessevas · 2022-12-27T12:32:16.000Z

-r should be -R if you want recursive search

pigdog · 2022-12-27T15:38:38.000Z

If you’re testing with example.org, I think the only thing you should receive is the index.

Have you tried it elsewhere?

cyberspice82 · 2022-12-28T15:44:57.000Z

thanks, I will give it a go with -R

Yes I have tried a few other websites also.

pigdog · 2022-12-28T18:29:27.000Z

I’m not an expert with “wget”, and I’m hesitant to disagree with jessevas.

But the help for wget version 1.20.3 (ht peterw2300) shows lower case “-r” for recursive:

wget -V
GNU Wget 1.20.3 built on linux-gnu.

Recursive download:
  -r,  --recursive                 specify recursive download
  -l,  --level=NUMBER              maximum recursion depth (inf or 0 for infinite)
       --delete-after              delete files locally after downloading them
  -k,  --convert-links             make links in downloaded HTML or CSS point to
                                     local files
       --convert-file-only         convert the file part of the URLs only (usually known as the basename)
       --backups=N                 before writing file X, rotate up to N backup files
  -K,  --backup-converted          before converting file X, back up as X.orig
  -m,  --mirror                    shortcut for -N -r -l inf --no-remove-listing
  -p,  --page-requisites           get all images, etc. needed to display HTML page
       --strict-comments           turn on strict (SGML) handling of HTML comments

Recursive accept/reject:
  -A,  --accept=LIST               comma-separated list of accepted extensions
  -R,  --reject=LIST               comma-separated list of rejected extensions
       --accept-regex=REGEX        regex matching accepted URLs
       --reject-regex=REGEX        regex matching rejected URLs
       --regex-type=TYPE           regex type (posix|pcre)

pigdog · 2022-12-28T18:34:02.000Z

Also, your command is asking for both recursion and mirroring. Recursion is a subset of mirroring. I wonder if they’re stepping on each other when used at the same time. Try one or the other to see if you get better results.