first commit

This commit is contained in:
Fabio 2025-12-10 16:53:24 +01:00
commit 06a29f4640
4 changed files with 518 additions and 0 deletions

101
README.md Normal file
View file

@ -0,0 +1,101 @@
# 🌐 Website Downloader CLI
[![CI Website Downloader](https://github.com/PKHarsimran/website-downloader/actions/workflows/python-app.yml/badge.svg)](https://github.com/PKHarsimran/website-downloader/actions/workflows/python-app.yml)
[![Lint & Style](https://github.com/PKHarsimran/website-downloader/actions/workflows/lint.yml/badge.svg)](https://github.com/PKHarsimran/website-downloader/actions/workflows/lint.yml)
[![Automatic Dependency Submission](https://github.com/PKHarsimran/website-downloader/actions/workflows/dependency-graph/auto-submission/badge.svg)](https://github.com/PKHarsimran/website-downloader/actions/workflows/dependency-graph/auto-submission)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python](https://img.shields.io/badge/Python-3.10%2B-blue.svg)](https://www.python.org/)
[![Code style: Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
Website Downloader CLI is a **tiny, pure-Python** site-mirroring tool that lets you grab a complete, browsable offline copy of any publicly reachable website:
* Recursively crawls every same-origin link (including “pretty” `/about/` URLs)
* Downloads **all** assets (images, CSS, JS, …)
* Rewrites internal links so pages open flawlessly from your local disk
* Streams files concurrently with automatic retry / back-off
* Generates a clean, flat directory tree (`example_com/index.html`, `example_com/about/index.html`, …)
* Handles extremely long filenames safely via hashing and graceful fallbacks
> Perfect for web archiving, pentesting labs, long flights, or just poking around a site without an internet connection.
---
## 🚀 Quick Start
```bash
# 1. Grab the code
git clone https://github.com/PKHarsimran/website-downloader.git
cd website-downloader
# 2. Install dependencies (only two runtime libs!)
pip install -r requirements.txt
# 3. Mirror a site no prompts needed
python website-downloader.py \
--url https://harsim.ca \
--destination harsim_ca_backup \
--max-pages 100 \
--threads 8
```
---
## 🛠️ Libraries Used
| Library | Emoji | Purpose in this project |
|---------|-------|-------------------------|
| **requests** + **urllib3.Retry** | 🌐 | High-level HTTP client with automatic retry / back-off for flaky hosts |
| **BeautifulSoup (bs4)** | 🍜 | Parses downloaded HTML and extracts every `<a>`, `<img>`, `<script>`, and `<link>` |
| **argparse** | 🛠️ | Powers the modern CLI (`--url`, `--destination`, `--max-pages`, `--threads`, …) |
| **logging** | 📝 | Dual console / file logging with colour + crawl-time stats |
| **threading** & **queue** | ⚙️ | Lightweight thread-pool that streams images/CSS/JS concurrently |
| **pathlib** & **os** | 📂 | Cross-platform file-system helpers (`Path` magic, directory creation, etc.) |
| **time** | ⏱️ | Measures per-page latency and total crawl duration |
| **urllib.parse** | 🔗 | Safely joins / analyses URLs and rewrites them to local relative paths |
| **sys** | 🖥️ | Directs log output to `stdout` and handles graceful interrupts (`Ctrl-C`) |
## 🗂️ Project Structure
| Path | What it is | Key features |
|------|------------|--------------|
| `website_downloader.py` | **Single-entry CLI** that performs the entire crawl *and* link-rewriting pipeline. | • Persistent `requests.Session` with automatic retries<br>• Breadth-first crawl capped by `--max-pages` (default = 50)<br>• Thread-pool (configurable via `--threads`, default = 6) to fetch images/CSS/JS in parallel<br>• Robust link rewriting so every internal URL works offline (pretty-URL folders ➜ `index.html`, plain paths ➜ `.html`)<br>• Smart output folder naming (`example.com` → `example_com`)<br>• Colourised console + file logging with per-page latency and crawl summary |
| `requirements.txt` | Minimal dependency pin-list. Only **`requests`** and **`beautifulsoup4`** are third-party; everything else is Python ≥ 3.10 std-lib. |
| `web_scraper.log` | Auto-generated run log (rotates/overwrites on each invocation). Useful for troubleshooting or audit trails. |
| `README.md` | The document youre reading quick-start, flags, and architecture notes. |
| *(output folder)* | Created at runtime (`example_com/ …`) mirrors the remote directory tree with `index.html` stubs and all static assets. |
> **Removed:** The old `check_download.py` verifier is no longer required because the new downloader performs integrity checks (missing files, broken internal links) during the crawl and reports any issues directly in the log summary.
## ✨ Recent Improvements
✅ Type Conversion Fix
Fixed a TypeError caused by int(..., 10) when non-string arguments were passed.
✅ Safer Path Handling
Added intelligent path shortening and hashing for long filenames to prevent
OSError: [Errno 36] File name too long errors.
✅ Improved CLI Experience
Rebuilt argument parsing with argparse for cleaner syntax and validation.
✅ Code Quality & Linting
Applied Black + Flake8 formatting; the project now passes all CI lint checks.
✅ Logging & Stability
Improved error handling, logging, and fallback mechanisms for failed writes.
✅ Skip Non-Fetchable Schemes
The crawler now safely skips `mailto:`, `tel:`, `javascript:`, and `data:` links instead of trying to download them.
This prevents `requests.exceptions.InvalidSchema: No connection adapters were found` errors and keeps those links intact in saved HTML.
## 🤝 Contributing
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
## 📜 License
This project is licensed under the MIT License.
## ❤️ Support This Project
[![Donate](https://img.shields.io/badge/Donate-PayPal-blue)](https://www.paypal.com/donate/?business=MVEWG3QAX6UBC&no_recurring=1&item_name=Github+Project+-+Website+downloader&currency_code=CAD)

7
downloadsite.sh Executable file
View file

@ -0,0 +1,7 @@
#!/bin/bash
source /usr/local/python/website-downloader/.venv/bin/activate
python /usr/local/python/website-downloader/website-downloader.py \
--url $1 \
--destination $2 \
--max-pages 100 \
--threads 8

4
requirements.txt Normal file
View file

@ -0,0 +1,4 @@
requests~=2.32.4
beautifulsoup4~=4.13.4
wget~=3.2
urllib3~=2.5.0

406
website-downloader.py Executable file
View file

@ -0,0 +1,406 @@
#!/usr/bin/env python3
from __future__ import annotations
import argparse
import logging
import os
import queue
import sys
import threading
import time
from hashlib import sha256
from pathlib import Path
from typing import Optional
from urllib.parse import urljoin, urlparse
import requests
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from urllib3.util import Retry
# ---------------------------------------------------------------------------
# Config / constants
# ---------------------------------------------------------------------------
LOG_FMT = "%(asctime)s | %(levelname)-8s | %(threadName)s | %(message)s"
DEFAULT_HEADERS = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:128.0) "
"Gecko/20100101 Firefox/128.0"
}
TIMEOUT = 15 # seconds
CHUNK_SIZE = 8192 # bytes
# Conservative margins under common OS limits (~255260 bytes)
MAX_PATH_LEN = 240
MAX_SEG_LEN = 120
# ---------------------------------------------------------------------------
# Logging
# ---------------------------------------------------------------------------
logging.basicConfig(
filename="web_scraper.log",
level=logging.DEBUG,
format=LOG_FMT,
datefmt="%H:%M:%S",
force=True,
)
_console = logging.StreamHandler(sys.stdout)
_console.setLevel(logging.INFO)
_console.setFormatter(logging.Formatter(LOG_FMT, datefmt="%H:%M:%S"))
logging.getLogger().addHandler(_console)
log = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# HTTP session (retry, timeouts, custom UA)
# ---------------------------------------------------------------------------
SESSION = requests.Session()
RETRY_STRAT = Retry(
total=5,
backoff_factor=0.5,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET", "HEAD"],
)
SESSION.mount("http://", HTTPAdapter(max_retries=RETRY_STRAT))
SESSION.mount("https://", HTTPAdapter(max_retries=RETRY_STRAT))
SESSION.headers.update(DEFAULT_HEADERS)
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def create_dir(path: Path) -> None:
"""Create path (and parents) if it does not already exist."""
if not path.exists():
path.mkdir(parents=True, exist_ok=True)
log.debug("Created directory %s", path)
def sanitize(url_fragment: str) -> str:
"""Strip back-references and Windows backslashes."""
return url_fragment.replace("\\", "/").replace("..", "").strip()
NON_FETCHABLE_SCHEMES = {"mailto", "tel", "sms", "javascript", "data", "geo", "blob"}
def is_httpish(u: str) -> bool:
"""True iff the URL is http(s) or relative (no scheme)."""
p = urlparse(u)
return (p.scheme in ("http", "https")) or (p.scheme == "")
def is_non_fetchable(u: str) -> bool:
"""True iff the URL clearly shouldn't be fetched (mailto:, tel:, data:, ...)."""
p = urlparse(u)
return p.scheme in NON_FETCHABLE_SCHEMES
def is_internal(link: str, root_netloc: str) -> bool:
"""Return True if link belongs to root_netloc (or is protocol-relative)."""
parsed = urlparse(link)
return not parsed.netloc or parsed.netloc == root_netloc
def _shorten_segment(segment: str, limit: int = MAX_SEG_LEN) -> str:
"""
Shorten a single path segment if over limit.
Preserve extension; append a short hash to keep it unique.
"""
if len(segment) <= limit:
return segment
p = Path(segment)
stem, suffix = p.stem, p.suffix
h = sha256(segment.encode("utf-8")).hexdigest()[:12]
# leave room for '-' + hash + suffix
keep = max(0, limit - len(suffix) - 13)
return f"{stem[:keep]}-{h}{suffix}"
def to_local_path(parsed: urlparse, site_root: Path) -> Path:
"""
Map an internal URL to a local file path under site_root.
- Adds 'index.html' where appropriate.
- Converts extensionless paths to '.html'.
- Appends a short query-hash when ?query is present to avoid collisions.
- Enforces per-segment and overall path length limits. If still too long,
hashes the leaf name.
"""
rel = parsed.path.lstrip("/")
if not rel:
rel = "index.html"
elif rel.endswith("/"):
rel += "index.html"
elif not Path(rel).suffix:
rel += ".html"
if parsed.query:
qh = sha256(parsed.query.encode("utf-8")).hexdigest()[:10]
p = Path(rel)
rel = str(p.with_name(f"{p.stem}-q{qh}{p.suffix}"))
# Shorten individual segments
parts = Path(rel).parts
parts = tuple(_shorten_segment(seg, MAX_SEG_LEN) for seg in parts)
local_path = site_root / Path(*parts)
# If full path is still too long, hash the leaf
if len(str(local_path)) > MAX_PATH_LEN:
p = local_path
h = sha256(parsed.geturl().encode("utf-8")).hexdigest()[:16]
leaf = _shorten_segment(f"{p.stem}-{h}{p.suffix}", MAX_SEG_LEN)
local_path = p.with_name(leaf)
return local_path
def safe_write_text(path: Path, text: str, encoding: str = "utf-8") -> Path:
"""
Write text to path, falling back to a hashed filename if OS rejects it
(e.g., filename too long). Returns the final path used.
"""
try:
path.write_text(text, encoding=encoding)
return path
except OSError as exc:
log.warning("Write failed for %s: %s. Falling back to hashed leaf.", path, exc)
p = path
h = sha256(str(p).encode("utf-8")).hexdigest()[:16]
fallback = p.with_name(_shorten_segment(f"{p.stem}-{h}{p.suffix}", MAX_SEG_LEN))
create_dir(fallback.parent)
fallback.write_text(text, encoding=encoding)
return fallback
# ---------------------------------------------------------------------------
# Fetchers
# ---------------------------------------------------------------------------
def fetch_html(url: str) -> Optional[BeautifulSoup]:
"""Download url and return a BeautifulSoup tree (or None on error)."""
try:
resp = SESSION.get(url, timeout=TIMEOUT)
resp.raise_for_status()
return BeautifulSoup(resp.text, "html.parser")
except Exception as exc: # noqa: BLE001
log.warning("HTTP error for %s %s", url, exc)
return None
def fetch_binary(url: str, dest: Path) -> None:
"""Stream url to dest unless it already exists. Safe against long paths."""
if dest.exists():
return
try:
resp = SESSION.get(url, timeout=TIMEOUT, stream=True)
resp.raise_for_status()
create_dir(dest.parent)
try:
with dest.open("wb") as fh:
for chunk in resp.iter_content(CHUNK_SIZE):
fh.write(chunk)
log.debug("Saved resource -> %s", dest)
except OSError as exc:
# Fallback to hashed leaf if OS rejects path
log.warning("Binary write failed for %s: %s. Using fallback.", dest, exc)
p = dest
h = sha256(str(p).encode("utf-8")).hexdigest()[:16]
fallback = p.with_name(
_shorten_segment(f"{p.stem}-{h}{p.suffix}", MAX_SEG_LEN)
)
create_dir(fallback.parent)
with fallback.open("wb") as fh:
for chunk in resp.iter_content(CHUNK_SIZE):
fh.write(chunk)
log.debug("Saved resource (fallback) -> %s", fallback)
except Exception as exc: # noqa: BLE001
log.error("Failed to save %s %s", url, exc)
# ---------------------------------------------------------------------------
# Link rewriting
# ---------------------------------------------------------------------------
def rewrite_links(
soup: BeautifulSoup, page_url: str, site_root: Path, page_dir: Path
) -> None:
"""Rewrite internal links to local relative paths under site_root."""
root_netloc = urlparse(page_url).netloc
for tag in soup.find_all(["a", "img", "script", "link"]):
attr = "href" if tag.name in {"a", "link"} else "src"
if not tag.has_attr(attr):
continue
original = sanitize(tag[attr])
if (
original.startswith("#")
or is_non_fetchable(original)
or not is_httpish(original)
):
continue
abs_url = urljoin(page_url, original)
if not is_internal(abs_url, root_netloc):
continue # external leave untouched
local_path = to_local_path(urlparse(abs_url), site_root)
try:
tag[attr] = os.path.relpath(local_path, page_dir)
except ValueError:
# Different drives on Windows, etc.
tag[attr] = str(local_path)
# ---------------------------------------------------------------------------
# Crawl coordinator
# ---------------------------------------------------------------------------
def crawl_site(start_url: str, root: Path, max_pages: int, threads: int) -> None:
"""Breadth-first crawl limited to max_pages. Downloads assets via workers."""
q_pages: queue.Queue[str] = queue.Queue()
q_pages.put(start_url)
seen_pages: set[str] = set()
download_q: queue.Queue[tuple[str, Path]] = queue.Queue()
def worker() -> None:
while True:
try:
url, dest = download_q.get(timeout=3)
except queue.Empty:
return
if is_non_fetchable(url) or not is_httpish(url):
log.debug("Skip non-fetchable: %s", url)
download_q.task_done()
continue
fetch_binary(url, dest)
download_q.task_done()
workers: list[threading.Thread] = []
for i in range(max(1, threads)):
t = threading.Thread(target=worker, name=f"DL-{i+1}", daemon=True)
t.start()
workers.append(t)
start_time = time.time()
root_netloc = urlparse(start_url).netloc
while not q_pages.empty() and len(seen_pages) < max_pages:
page_url = q_pages.get()
if page_url in seen_pages:
continue
seen_pages.add(page_url)
log.info("[%s/%s] %s", len(seen_pages), max_pages, page_url)
soup = fetch_html(page_url)
if soup is None:
continue
# Gather links & assets
for tag in soup.find_all(["img", "script", "link", "a"]):
link = tag.get("src") or tag.get("href")
if not link:
continue
link = sanitize(link)
if link.startswith("#") or is_non_fetchable(link) or not is_httpish(link):
continue
abs_url = urljoin(page_url, link)
parsed = urlparse(abs_url)
if not is_internal(abs_url, root_netloc):
continue
dest_path = to_local_path(parsed, root)
# HTML?
if parsed.path.endswith("/") or not Path(parsed.path).suffix:
if abs_url not in seen_pages and abs_url not in list(
q_pages.queue
): # type: ignore[arg-type]
q_pages.put(abs_url)
else:
download_q.put((abs_url, dest_path))
# Save current page
local_path = to_local_path(urlparse(page_url), root)
create_dir(local_path.parent)
rewrite_links(soup, page_url, root, local_path.parent)
html = soup.prettify()
final_path = safe_write_text(local_path, html, encoding="utf-8")
log.debug("Saved page %s", final_path)
download_q.join()
elapsed = time.time() - start_time
if seen_pages:
log.info(
"Crawl finished: %s pages in %.2fs (%.2fs avg)",
len(seen_pages),
elapsed,
elapsed / len(seen_pages),
)
else:
log.warning("Nothing downloaded check URL or connectivity")
# ---------------------------------------------------------------------------
# Helper function for output folder
# ---------------------------------------------------------------------------
def make_root(url: str, custom: Optional[str]) -> Path:
"""Derive output folder from URL if custom not supplied."""
return Path(custom) if custom else Path(urlparse(url).netloc.replace(".", "_"))
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def parse_args() -> argparse.Namespace:
p = argparse.ArgumentParser(
description="Recursively mirror a website for offline use.",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
p.add_argument(
"--url",
required=True,
help="Starting URL to crawl (e.g., https://example.com/).",
)
p.add_argument(
"--destination",
default=None,
help="Output folder (defaults to a folder derived from the URL).",
)
p.add_argument(
"--max-pages",
type=int,
default=50,
help="Maximum number of HTML pages to crawl.",
)
p.add_argument(
"--threads",
type=int,
default=6,
help="Number of concurrent download workers.",
)
return p.parse_args()
if __name__ == "__main__":
args = parse_args()
if args.max_pages < 1:
log.error("--max-pages must be >= 1")
sys.exit(2)
if args.threads < 1:
log.error("--threads must be >= 1")
sys.exit(2)
host = args.url
root = make_root(args.url, args.destination)
crawl_site(host, root, args.max_pages, args.threads)