commit 06a29f4640eea35a28dc776c50b6e81ebcfab4ff Author: Fabio Date: Wed Dec 10 16:53:24 2025 +0100 first commit diff --git a/README.md b/README.md new file mode 100644 index 0000000..28f003c --- /dev/null +++ b/README.md @@ -0,0 +1,101 @@ +# 🌐 Website Downloader CLI +[![CI – Website Downloader](https://github.com/PKHarsimran/website-downloader/actions/workflows/python-app.yml/badge.svg)](https://github.com/PKHarsimran/website-downloader/actions/workflows/python-app.yml) +[![Lint & Style](https://github.com/PKHarsimran/website-downloader/actions/workflows/lint.yml/badge.svg)](https://github.com/PKHarsimran/website-downloader/actions/workflows/lint.yml) +[![Automatic Dependency Submission](https://github.com/PKHarsimran/website-downloader/actions/workflows/dependency-graph/auto-submission/badge.svg)](https://github.com/PKHarsimran/website-downloader/actions/workflows/dependency-graph/auto-submission) +[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) +[![Python](https://img.shields.io/badge/Python-3.10%2B-blue.svg)](https://www.python.org/) +[![Code style: Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) + +Website Downloader CLI is a **tiny, pure-Python** site-mirroring tool that lets you grab a complete, browsable offline copy of any publicly reachable website: + +* Recursively crawls every same-origin link (including β€œpretty” `/about/` URLs) +* Downloads **all** assets (images, CSS, JS, …) +* Rewrites internal links so pages open flawlessly from your local disk +* Streams files concurrently with automatic retry / back-off +* Generates a clean, flat directory tree (`example_com/index.html`, `example_com/about/index.html`, …) +* Handles extremely long filenames safely via hashing and graceful fallbacks + +> Perfect for web archiving, pentesting labs, long flights, or just poking around a site without an internet connection. + +--- + +## πŸš€ Quick Start + +```bash +# 1. Grab the code +git clone https://github.com/PKHarsimran/website-downloader.git +cd website-downloader + +# 2. Install dependencies (only two runtime libs!) +pip install -r requirements.txt + +# 3. Mirror a site – no prompts needed +python website-downloader.py \ + --url https://harsim.ca \ + --destination harsim_ca_backup \ + --max-pages 100 \ + --threads 8 +``` + +--- + +## πŸ› οΈ Libraries Used + +| Library | Emoji | Purpose in this project | +|---------|-------|-------------------------| +| **requests** + **urllib3.Retry** | 🌐 | High-level HTTP client with automatic retry / back-off for flaky hosts | +| **BeautifulSoup (bs4)** | 🍜 | Parses downloaded HTML and extracts every ``, ``, `