SiteSucker is a Macintosh application that automatically downloads Web sites from the Internet. It does this by copying the site's Web pages, images, backgrounds, movies, and other files to your local hard drive. SiteSucker can be used to make local copies of your Web sites for easy maintenance. SiteSucker can download files unmodified or it can. SiteSucker is an application that automatically downloads Web sites from the Internet. It does this by copying the site's HTML documents, images, backgrounds, movies, and other files to your local hard drive. Just enter a URL and click a button and SiteSucker can download the entire site. SiteSucker can be used to make local copies of Web sites. By default, SiteSucker 'localizes' the files it downloads, allowing you to browse a site offline, but it can also download sites without. SiteSucker Alternatives. SiteSucker is described as 'macOS application that automatically downloads websites from the Internet. It does this by asynchronously copying the site's webpages, images, PDFs, style sheets, and other files to your local hard drive, duplicating the site's directory structure' and is an app in the File Sharing category.
Description
SiteSucker is a Macintosh application that automatically downloads websites from the Internet. It does this by asynchronously copying the site's webpages, images, PDFs, style sheets, and other files to your local hard drive, duplicating the site's directory structure. Just enter a URL (Uniform Resource Locator), press return, and SiteSucker can download an entire website.
SiteSucker can be used to make local copies of websites. By default, SiteSucker 'localizes' the files it downloads, allowing you to browse a site offline, but it can also download sites without modification.
You can save all the information about a download in a document. This allows you to create a document that you can use to perform the same download whenever you want. If SiteSucker is in the middle of a download when you choose the Save command, SiteSucker will pause the download and save its status with the document. When you open the document later, you can restart the download from where it left off by pressing the Resume button.
Requirements
The current version of SiteSucker is a universal app built to run on Macintosh computers with Intel or Apple silicon processors. It requires macOS 11 Big Sur or greater. Of course, to download files, your computer will also need an Internet connection.
Available Languages
Users from around the world have translated SiteSucker from English into other languages. Currently, SiteSucker can be viewed in the following languages:
- English
- French — Translation by Jean-Pierre Kuypers
- German — Translation by Christoph Schmitz
- Italian — Translation by Massimo Ruffinengo
- Portuguese — Translation by Paulo Neto
- Spanish — Translation by Borja Santos-Diez Vázquez
Getting SiteSucker
Click on the image below to get the latest version of SiteSucker from the Mac App Store.
The current version of SiteSucker is 4.0.5.
For earlier operating systems, the following versions of SiteSucker are available:
- For macOS 10.9 Mavericks or greater: SiteSucker 2.4.6
- For macOS 10.6 Snow Leopard, 10.7 Lion, or 10.8 Mountain Lion: SiteSucker 2.3.6
- For macOS 10.5 Leopard: SiteSucker 2.3.3
- For macOS 10.4 Tiger: SiteSucker 2.2.4
- For releases prior to macOS 10.4 Tiger: SiteSucker 1.6.9
All versions of SiteSucker prior to version 2.5 are available from the Version History page.
SiteSucker Pro
SiteSucker Pro is an enhanced version of SiteSucker that can download embedded videos, including embedded YouTube and Vimeo videos. You can try SiteSucker Pro for up to 14 days before you buy it. During that period, the application is fully functional except that you can download no more than 100 files at a time. You can purchase SiteSucker Pro from the Registration dialog within the app. The End User License Agreement specifies the rights and restrictions which apply to the use of SiteSucker Pro.
The current version of SiteSucker Pro is 4.0.5.
For earlier operating systems, the following version of SiteSucker Pro is available:
- For macOS 10.14 Mojave or greater: SiteSucker Pro 3.2.7
To download a disk image containing the latest version of SiteSucker Pro, click on the button below.
Support
SiteSucker help references online manuals that explain all of its features. You can access the manual for the current version of SiteSucker by clicking on one of the links below:
- English: SiteSucker Manual for macOS
- French: Manuel SiteSucker pour macOS
- Portuguese: Manual do SiteSucker para macOS
Email support is provided by the author: Rick Cranisky <ss-osx-support@ricks-apps.com>.
Send in your feature requests, bug reports, user interface gripes, or anything else you have to say about SiteSucker. If you are having problems downloading a site, please provide the site's URL in your email message and some indication of your SiteSucker settings.
First written: 2016-2017. Last nontrivial update: 2019 Jan 13.
Summary
One way to back up a website—whether your own or someone else's—is to use a tool that downloads the website. Then you can back up the resulting files to the cloud, optical media, etc. This page gives some information on downloading websites using tools like HTTrack and SiteSucker.
Note: Here's a list of the domains I have downloaded. Let me know if you don't want your site to be downloaded.
Contents
- HTTrack
- Compress archived websites?
HTTrack
On Windows, HTTrack is commonly used to download websites, and it's free. Once you download a site, you can zip its folder and then back that up the way you would any of your other files.
I'm still a novice at HTTrack, but from my experience so far, I've found that it captures only ~90% of a website's individual pages on average. For some websites (like the one you're reading now), HTTrack seems to capture everything, but for other sites, it misses some pages. Maybe this is because of complications with redirects? I'm not sure. Still, ~90% backup is much better than 0%.
Crack adobe acrobat xi for mac. You can verify which pages got backed up by opening the domain's index.html file from HTTrack's download folder and browsing around using the files on your hard drive. It's best if you disconnect from the Internet when doing this because I found that if I was online when browsing around the downloaded file contents, some pages got loaded from the Internet, not from the local files that I was testing.
Pictures don't seem to load offline, but you can check that they're still being downloaded. For example, for WordPress site downloads, look at the wp-contentuploads folder.
I won't explain the full how-to steps of using HTTrack, but below are two problems that I ran into.
Troubleshooting: gets too many pages
When I tried to use HTTrack to download a single website using the program's default settings (as of Nov. 2016), I downloaded the website but also got some other random files from other domains, presumably from links on the main domain. In some cases, the number of links that the program tried to download grew without limit, and I had to cancel. In order to download files only from the desired domain, I had to do the following.
Step 1: Specify the domain(s) to download (as I had already been doing).
Step 2: Add a Scan Rules pattern like this: +https://*animalcharityevaluators.org/* . This way, only links on that domain will be downloaded.
Including a * before the main domain name is useful in case the site has subdomains. For example, the site https://animalcharityevaluators.org/ has a subdomain http://researchfund.animalcharityevaluators.org/ , which would be missed if you only used the pattern +https://animalcharityevaluators.org/* .
Troubleshooting: Error: 'Forbidden' (403)
Some pages gave me a 'Forbidden' error, which prevented any content from being downloaded. I was able to fix this by clicking on 'Set options..', choosing the 'Browser ID' tab, and then changing 'Browser 'Identity' from the default of 'Mozilla/4.5 (compatible: HTTrack 3.0x; Windows 98)' to 'Java1.1.4'. I chose the Java identity because it didn't contain the substring 'HTTrack', which may have been the reason I was being blocked.
SiteSucker
On Mac, I download websites using SiteSucker. This page gives configuration details that I use when downloading certain sites.
Including redirects
I think website downloads using the above methods don't include the redirects that a site may be using. A redirect ensures that an old link doesn't break when you move a page to a new url. If you back up your website, it's nice to include the redirects in the backup, in case you need to regenerate your website in the future.
I'm not sure if there's a way to download the redirects of a site you don't own; let me know if there is. For a site you do own, sometimes you can back up the redirects by saving the relevant .htaccess
Videoblend for mac. file. In my case, I use the 'Redirection' plugin in WordPress, and its menu has an 'Import/Export' option; I find that the 'Nginx rewrite rules' export format is concise and readable.
Non-linked content
HTTrack and SiteSucker are web crawlers, which means they identify pages on your site by following links. If you have content on your site that's not linked from the starting page you provide, then I assume these programs won't download it. (I've verified that this is true at least for SiteSucker.) If you want a page or file on your website to be downloaded, make sure there's at least one link to it. If you don't want the link to your content to be noticeable, you can add a hyperlink with no anchor text, like this: <a href='my_url.pdf'></a>
I use this trick for files that I store on my sites as backups. In particular, whenever I publish a substantive article on a website that I don't control, such as an interview published on someone else's site, I create a PDF backup of the page because I can't guarantee that the other person will keep the content online indefinitely. On my page I add a visible hyperlink that points to the other person's site, but I also upload the PDF backup to my own site in case the original content ever disappears. Since I want these PDF files to be included in backups of my website content, I create hyperlinks to these PDF files with no anchor text.
[Update in 2018: For content of mine that's published on other people's sites, I've decided to stop storing PDF backups on my website, since these backup files could theoretically still show up in Google results. Plus, if someone else takes his copy of the content down, there's a chance he did so deliberately, and I'd want to check with him before having a copy available on my site. My new approach is to back up my interviews and other content that's hosted on another person's site just to my private files—both a print-to-PDF copy and the raw HTML. If the content on the other person's site ever goes away, I can ask that person for permission to upload it to my own site.]
Fast ssd for mac. Images that are only used in the context of meta property='og:image'
(for Facebook image previews) and that aren't actually linked from the body of your HTML also won't be captured by crawlers. Again, you can add an a href
link to the image with no anchor text to make sure the image gets downloaded.
Images not on your site
If your site has images that aren't hosted on your own domain, then crawlers won't download those images when only downloading same-domain content. For example, if you use the WordPress Jetpack plugin with the Photon module, then an image that would normally be hosted at http://yoursitehere.com/wp-content/uploads/2014/04/myimage.jpg
will instead be hosted at something like https://i0.wp.com/yoursitehere.com/wp-content/uploads/2014/04/myimage.jpg?w=642
. As a result, a crawler won't download this image.
At least in SiteSucker, I think you can work around this problem by downloading http://yoursitehere.com/wp-content/uploads/
in addition tohttp://yoursitehere.com
. The download of http://yoursitehere.com/wp-content/uploads/
seems to pick up the images for some reason. (In fact, it picks up multiple sizes of each image.)
Saving PDFs of JavaScript calculations
A few of the pages on my websites contain JavaScript calculators, which produce output numbers, text, and graphs computed from inputs. For my calculators, the JavaScript is contained within the main HTML file, so backing up the HTML backs up the JavaScript. However, I think it's also important to save PDF backups of these pages that show the calculated results on the default input values, because JavaScript seems more brittle than plain HTML.
A regular HTML document is human-readable. Even if browsers 50 years into the future can't render present-day HTML files, a human with some knowledge of historical HTML tags could still understand 99%, if not 100%, of the HTML just by looking at it in a text editor. However, nontrivial JavaScript calculations are harder to understand just by looking at them. To get the results, you have to actually run the code, and it's not obvious to me that web browsers in, say, 20 years will be backward-compatible enough to run JavaScript that I might write today. Of course, I could probably update my JavaScript to accommodate future changes, but this requires constant vigilance, and there's a risk of introducing bugs along the way. Having a static snapshot of the results of the JavaScript calculations is useful in case the code breaks in the future and you don't have time to fix it. Plus, if you do update the code, once it's up and running again you can check the results of the calculations against the saved PDF files to ensure that you haven't inadvertently messed up the code while fixing it.
Compress archived websites?
Once you've downloaded a website using HTTrack or similar software, should you compress the website folder before backing it up to the cloud? I'm uncertain and would appreciate reader feedback, but here are some considerations.
My impression is that plain text files (such as raw HTML files) are more secure against format rot and bit rot, because 'They avoid some of the problems encountered with other file formats, such as endianness, padding bytes, or differences in the number of bytes in a machine word. Further, when data corruption occurs in a text file, it is often easier to recover and continue processing the remaining contents.' A Reddit comment says: 'Straight up txt files have a very low structural scope / over head, so unless you're doing something funky, a bit error is limited to a character byte.'
As a result, I plan to back up my own websites and other important sites mostly as uncompressed files (with some compressed copies thrown into the mix too). However, when backing up lots of other websites that are less essential, compression may make sense. This is especially so if the website download has a lot of redundancy. Following is an example.
Compression example with duplicate content
In 2017, I downloaded www.mattball.org using SiteSucker. The download had a huge amount of redundancy using the default SiteSucker download settings, because each blog comment on a blog post had its own url and thus downloaded the blog post again. For example, on a blog post with 7 comments, I got 8 copies of the blog HTML: 1 from the original post, and 7 from each of the 7 comment urls. The website download also included an enormous number of search pages. Probably I could prevent these copies from downloading with some jiggering of the settings, but I want to be able to download lots of sites with minimal per-site configuration, and I'm not sure that url-exclusion rules that I might apply in this case would work elsewhere.
In principle, compression can minimize the burden of duplicate content. Does it in practice? During the www.mattball.org download, I checked to see that the raw content downloaded so far occupied ~450 MB. Applying 'Normal' zip compression using Keka software gave a zip archive of 88 MB, which is about 1/5 the uncompressed size. Not bad. However, a 'Normal' 7z archive of the raw data was only 1.6 MB—a little more than 1/300th of the uncompressed size!
Using a simple test folder with two copies of a file, I verified that zip compression doesn't detect duplicate files, but 7z compression does. Presumably this explains the dramatic size reduction using 7z. This person found the same: 'You might expect that ZIP is smart enough to figure out this is repeating data and use only one compression object inside the .zip, but this is not the case[..] Basically most such utilities behave similarly (tar.gz, tar.bz2, rar in solid mode) - only 7zip caught me [..].'
Security concerns?
Is it dangerous to download websites because you might make a request to a dangerous url? I'm still exploring this topic and would like advice.
My tentative guess is that the risk is low if you only download web pages from a given (trustworthy) domain. If you also download pages on other domains that are linked from the first domain, perhaps there's more risk?
HTTrack's FAQ says: 'You may encounter websites which were corrupted by viruses, and downloading data on these websites might be dangerous if you execute downloaded executables, or if embedded pages contain infected material (as dangerous as if using a regular Browser). Always ensure that websites you are crawling are safe.'
This page says: 'SiteSucker totally ignores JavaScript. Any link specified within JavaScript will not be seen by SiteSucker and will not be downloaded.' Does this help with security? How much?
GBC (2013): 'Essentially all BROWSER vulnerabilities (ie. not vulns. in plugins like java or flash) involve and rely on JavaScript (JS) running.'
Using downloads for monitoring website changes
Sitesucker For Windows 10
Suppose you want to monitor what changes are done to your website over time, such as to track what revisions your fellow authors are making to articles. While I imagine there are various ways to do this, one relatively low-tech method is as follows. Periodically (say, every few months, or at whatever frequency suits you) download a new copy of your website using HTTrack or SiteSucker. Store at least the previous download as well. Then run diff -r
on your two website-download folders to see what has changed. You could make this more sophisticated by adding logic to ignore trivial changes or changes in files you don't care about.
Sitesucker Manual
Of course, you could also do a diff on the website database .sql
file directly if you can download it.