Grab entire website content with “wget”
The web get feature “wget” enables a user to grab the entire content of the website. There may be situations where your may have difficulty in getting the source code from your software vendor, or may not have access details to certain location in which case “Wget” comes in handy. This feature is available for win32 platform as well. Work around for this is for the user to install cygwin to run the “wget” command
$ cd /tmp
$ mkdir sitename.com
$ cd sitename.com
$ wget -r -H -k -Dsitename.com, www.sitename.com sitenamey.com
Switches used …
-r switch for recursively handling files
-H switch to “span hosts” as in some sites there may be links to other other domains
-D switch to indicate from which domains we will gather the files. It is ideal to use this with -H switch
-k switch to indicate that the links from the site refer to local copies and not to the original internet location
To get a mirror copy
$ wget -r http://somesite.com
Use it with –convert-links to make offline copies with local links
$ wget –convert-links -r http://somesite.com
To save the files with .html extension
$ wget –html-extension -r http://somesite.com
Now you will have all the files downloaded to sitenamey.com.
Where to get WGET?
Some sites may block the user agent used by Wget in order to avoid heavy trafficking of their bandwidth. You could inturn use the user agent bot from yahoo or google or msn to grab a local copy.
“wget” can be used with various settings to grab site content which includes options like setting http-keep-alive (persistent connections), cookies, post-data, https settings and ftp settings. Check the manual for more detailed help on this topic.
Another contender to “wget” is a tool called “HTTrack” which is free as well. It has a version for Windows users as well. Check http://www.httrack.com/ to download copy of this tool.
CGI is not a programming language. It is a standard programming interface which gives a web page the capability to...