first commit

2025-12-10 16:53:24 +01:00 · 2025-12-10 16:53:24 +01:00 · 06a29f4640
commit 06a29f4640
4 changed files with 518 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,101 @@
+# 🌐 Website Downloader CLI  
+[![CI – Website Downloader](https://github.com/PKHarsimran/website-downloader/actions/workflows/python-app.yml/badge.svg)](https://github.com/PKHarsimran/website-downloader/actions/workflows/python-app.yml)
+[![Lint & Style](https://github.com/PKHarsimran/website-downloader/actions/workflows/lint.yml/badge.svg)](https://github.com/PKHarsimran/website-downloader/actions/workflows/lint.yml)
+[![Automatic Dependency Submission](https://github.com/PKHarsimran/website-downloader/actions/workflows/dependency-graph/auto-submission/badge.svg)](https://github.com/PKHarsimran/website-downloader/actions/workflows/dependency-graph/auto-submission)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+[![Python](https://img.shields.io/badge/Python-3.10%2B-blue.svg)](https://www.python.org/)
+[![Code style: Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
+
+Website Downloader CLI is a **tiny, pure-Python** site-mirroring tool that lets you grab a complete, browsable offline copy of any publicly reachable website:
+
+* Recursively crawls every same-origin link (including “pretty” `/about/` URLs)
+* Downloads **all** assets (images, CSS, JS, …)
+* Rewrites internal links so pages open flawlessly from your local disk
+* Streams files concurrently with automatic retry / back-off
+* Generates a clean, flat directory tree (`example_com/index.html`, `example_com/about/index.html`, …)
+* Handles extremely long filenames safely via hashing and graceful fallbacks
+
+> Perfect for web archiving, pentesting labs, long flights, or just poking around a site without an internet connection.
+
+---
+
+## 🚀 Quick Start
+
+```bash
+# 1. Grab the code
+git clone https://github.com/PKHarsimran/website-downloader.git
+cd website-downloader
+
+# 2. Install dependencies (only two runtime libs!)
+pip install -r requirements.txt
+
+# 3. Mirror a site – no prompts needed
+python website-downloader.py \
+    --url https://harsim.ca \
+    --destination harsim_ca_backup \
+    --max-pages 100 \
+    --threads 8
+```
+
+---
+
+## 🛠️ Libraries Used
+
+| Library | Emoji | Purpose in this project |
+|---------|-------|-------------------------|
+| **requests** + **urllib3.Retry** | 🌐 | High-level HTTP client with automatic retry / back-off for flaky hosts |
+| **BeautifulSoup (bs4)** | 🍜 | Parses downloaded HTML and extracts every `<a>`, `<img>`, `<script>`, and `<link>` |
+| **argparse** | 🛠️ | Powers the modern CLI (`--url`, `--destination`, `--max-pages`, `--threads`, …) |
+| **logging** | 📝 | Dual console / file logging with colour + crawl-time stats |
+| **threading** & **queue** | ⚙️ | Lightweight thread-pool that streams images/CSS/JS concurrently |
+| **pathlib** & **os** | 📂 | Cross-platform file-system helpers (`Path` magic, directory creation, etc.) |
+| **time** | ⏱️ | Measures per-page latency and total crawl duration |
+| **urllib.parse** | 🔗 | Safely joins / analyses URLs and rewrites them to local relative paths |
+| **sys** | 🖥️ | Directs log output to `stdout` and handles graceful interrupts (`Ctrl-C`) |
+## 🗂️ Project Structure
+
+| Path | What it is | Key features |
+|------|------------|--------------|
+| `website_downloader.py` | **Single-entry CLI** that performs the entire crawl *and* link-rewriting pipeline. | • Persistent `requests.Session` with automatic retries<br>• Breadth-first crawl capped by `--max-pages` (default = 50)<br>• Thread-pool (configurable via `--threads`, default = 6) to fetch images/CSS/JS in parallel<br>• Robust link rewriting so every internal URL works offline (pretty-URL folders ➜ `index.html`, plain paths ➜ `.html`)<br>• Smart output folder naming (`example.com` → `example_com`)<br>• Colourised console + file logging with per-page latency and crawl summary |
+| `requirements.txt` | Minimal dependency pin-list. Only **`requests`** and **`beautifulsoup4`** are third-party; everything else is Python ≥ 3.10 std-lib. |
+| `web_scraper.log` | Auto-generated run log (rotates/overwrites on each invocation). Useful for troubleshooting or audit trails. |
+| `README.md` | The document you’re reading – quick-start, flags, and architecture notes. |
+| *(output folder)* | Created at runtime (`example_com/ …`) – mirrors the remote directory tree with `index.html` stubs and all static assets. |
+
+> **Removed:** The old `check_download.py` verifier is no longer required because the new downloader performs integrity checks (missing files, broken internal links) during the crawl and reports any issues directly in the log summary.
+
+## ✨ Recent Improvements
+
+✅ Type Conversion Fix
+Fixed a TypeError caused by int(..., 10) when non-string arguments were passed.
+
+✅ Safer Path Handling
+Added intelligent path shortening and hashing for long filenames to prevent
+OSError: [Errno 36] File name too long errors.
+
+✅ Improved CLI Experience
+Rebuilt argument parsing with argparse for cleaner syntax and validation.
+
+✅ Code Quality & Linting
+Applied Black + Flake8 formatting; the project now passes all CI lint checks.
+
+✅ Logging & Stability
+Improved error handling, logging, and fallback mechanisms for failed writes.
+
+✅ Skip Non-Fetchable Schemes  
+The crawler now safely skips `mailto:`, `tel:`, `javascript:`, and `data:` links instead of trying to download them.  
+This prevents `requests.exceptions.InvalidSchema: No connection adapters were found` errors and keeps those links intact in saved HTML.
+
+
+## 🤝 Contributing
+
+Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
+
+## 📜 License
+
+This project is licensed under the MIT License.
+
+## ❤️ Support This Project
+
+[![Donate](https://img.shields.io/badge/Donate-PayPal-blue)](https://www.paypal.com/donate/?business=MVEWG3QAX6UBC&no_recurring=1&item_name=Github+Project+-+Website+downloader&currency_code=CAD)
+
--- a/downloadsite.sh
+++ b/downloadsite.sh
@ -0,0 +1,7 @@
+#!/bin/bash
+source /usr/local/python/website-downloader/.venv/bin/activate
+python /usr/local/python/website-downloader/website-downloader.py \
+    --url $1 \
+    --destination $2 \
+    --max-pages 100 \
+    --threads 8
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,4 @@
+requests~=2.32.4
+beautifulsoup4~=4.13.4
+wget~=3.2
+urllib3~=2.5.0
--- a/website-downloader.py
+++ b/website-downloader.py
@ -0,0 +1,406 @@
+#!/usr/bin/env python3
+from __future__ import annotations
+
+import argparse
+import logging
+import os
+import queue
+import sys
+import threading
+import time
+from hashlib import sha256
+from pathlib import Path
+from typing import Optional
+from urllib.parse import urljoin, urlparse
+
+import requests
+from bs4 import BeautifulSoup
+from requests.adapters import HTTPAdapter
+from urllib3.util import Retry
+
+# ---------------------------------------------------------------------------
+# Config / constants
+# ---------------------------------------------------------------------------
+
+LOG_FMT = "%(asctime)s | %(levelname)-8s | %(threadName)s | %(message)s"
+
+DEFAULT_HEADERS = {
+    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:128.0) "
+    "Gecko/20100101 Firefox/128.0"
+}
+
+TIMEOUT = 15  # seconds
+CHUNK_SIZE = 8192  # bytes
+
+# Conservative margins under common OS limits (~255–260 bytes)
+MAX_PATH_LEN = 240
+MAX_SEG_LEN = 120
+
+
+# ---------------------------------------------------------------------------
+# Logging
+# ---------------------------------------------------------------------------
+
+logging.basicConfig(
+    filename="web_scraper.log",
+    level=logging.DEBUG,
+    format=LOG_FMT,
+    datefmt="%H:%M:%S",
+    force=True,
+)
+_console = logging.StreamHandler(sys.stdout)
+_console.setLevel(logging.INFO)
+_console.setFormatter(logging.Formatter(LOG_FMT, datefmt="%H:%M:%S"))
+logging.getLogger().addHandler(_console)
+log = logging.getLogger(__name__)
+
+
+# ---------------------------------------------------------------------------
+# HTTP session (retry, timeouts, custom UA)
+# ---------------------------------------------------------------------------
+
+SESSION = requests.Session()
+RETRY_STRAT = Retry(
+    total=5,
+    backoff_factor=0.5,
+    status_forcelist=[429, 500, 502, 503, 504],
+    allowed_methods=["GET", "HEAD"],
+)
+SESSION.mount("http://", HTTPAdapter(max_retries=RETRY_STRAT))
+SESSION.mount("https://", HTTPAdapter(max_retries=RETRY_STRAT))
+SESSION.headers.update(DEFAULT_HEADERS)
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def create_dir(path: Path) -> None:
+    """Create path (and parents) if it does not already exist."""
+    if not path.exists():
+        path.mkdir(parents=True, exist_ok=True)
+        log.debug("Created directory %s", path)
+
+
+def sanitize(url_fragment: str) -> str:
+    """Strip back-references and Windows backslashes."""
+    return url_fragment.replace("\\", "/").replace("..", "").strip()
+
+
+NON_FETCHABLE_SCHEMES = {"mailto", "tel", "sms", "javascript", "data", "geo", "blob"}
+
+
+def is_httpish(u: str) -> bool:
+    """True iff the URL is http(s) or relative (no scheme)."""
+    p = urlparse(u)
+    return (p.scheme in ("http", "https")) or (p.scheme == "")
+
+
+def is_non_fetchable(u: str) -> bool:
+    """True iff the URL clearly shouldn't be fetched (mailto:, tel:, data:, ...)."""
+    p = urlparse(u)
+    return p.scheme in NON_FETCHABLE_SCHEMES
+
+
+def is_internal(link: str, root_netloc: str) -> bool:
+    """Return True if link belongs to root_netloc (or is protocol-relative)."""
+    parsed = urlparse(link)
+    return not parsed.netloc or parsed.netloc == root_netloc
+
+
+def _shorten_segment(segment: str, limit: int = MAX_SEG_LEN) -> str:
+    """
+    Shorten a single path segment if over limit.
+    Preserve extension; append a short hash to keep it unique.
+    """
+    if len(segment) <= limit:
+        return segment
+    p = Path(segment)
+    stem, suffix = p.stem, p.suffix
+    h = sha256(segment.encode("utf-8")).hexdigest()[:12]
+    # leave room for '-' + hash + suffix
+    keep = max(0, limit - len(suffix) - 13)
+    return f"{stem[:keep]}-{h}{suffix}"
+
+
+def to_local_path(parsed: urlparse, site_root: Path) -> Path:
+    """
+    Map an internal URL to a local file path under site_root.
+
+    - Adds 'index.html' where appropriate.
+    - Converts extensionless paths to '.html'.
+    - Appends a short query-hash when ?query is present to avoid collisions.
+    - Enforces per-segment and overall path length limits. If still too long,
+      hashes the leaf name.
+    """
+    rel = parsed.path.lstrip("/")
+    if not rel:
+        rel = "index.html"
+    elif rel.endswith("/"):
+        rel += "index.html"
+    elif not Path(rel).suffix:
+        rel += ".html"
+
+    if parsed.query:
+        qh = sha256(parsed.query.encode("utf-8")).hexdigest()[:10]
+        p = Path(rel)
+        rel = str(p.with_name(f"{p.stem}-q{qh}{p.suffix}"))
+
+    # Shorten individual segments
+    parts = Path(rel).parts
+    parts = tuple(_shorten_segment(seg, MAX_SEG_LEN) for seg in parts)
+    local_path = site_root / Path(*parts)
+
+    # If full path is still too long, hash the leaf
+    if len(str(local_path)) > MAX_PATH_LEN:
+        p = local_path
+        h = sha256(parsed.geturl().encode("utf-8")).hexdigest()[:16]
+        leaf = _shorten_segment(f"{p.stem}-{h}{p.suffix}", MAX_SEG_LEN)
+        local_path = p.with_name(leaf)
+
+    return local_path
+
+
+def safe_write_text(path: Path, text: str, encoding: str = "utf-8") -> Path:
+    """
+    Write text to path, falling back to a hashed filename if OS rejects it
+    (e.g., filename too long). Returns the final path used.
+    """
+    try:
+        path.write_text(text, encoding=encoding)
+        return path
+    except OSError as exc:
+        log.warning("Write failed for %s: %s. Falling back to hashed leaf.", path, exc)
+        p = path
+        h = sha256(str(p).encode("utf-8")).hexdigest()[:16]
+        fallback = p.with_name(_shorten_segment(f"{p.stem}-{h}{p.suffix}", MAX_SEG_LEN))
+        create_dir(fallback.parent)
+        fallback.write_text(text, encoding=encoding)
+        return fallback
+
+
+# ---------------------------------------------------------------------------
+# Fetchers
+# ---------------------------------------------------------------------------
+
+
+def fetch_html(url: str) -> Optional[BeautifulSoup]:
+    """Download url and return a BeautifulSoup tree (or None on error)."""
+    try:
+        resp = SESSION.get(url, timeout=TIMEOUT)
+        resp.raise_for_status()
+        return BeautifulSoup(resp.text, "html.parser")
+    except Exception as exc:  # noqa: BLE001
+        log.warning("HTTP error for %s – %s", url, exc)
+        return None
+
+
+def fetch_binary(url: str, dest: Path) -> None:
+    """Stream url to dest unless it already exists. Safe against long paths."""
+    if dest.exists():
+        return
+    try:
+        resp = SESSION.get(url, timeout=TIMEOUT, stream=True)
+        resp.raise_for_status()
+        create_dir(dest.parent)
+        try:
+            with dest.open("wb") as fh:
+                for chunk in resp.iter_content(CHUNK_SIZE):
+                    fh.write(chunk)
+            log.debug("Saved resource -> %s", dest)
+        except OSError as exc:
+            # Fallback to hashed leaf if OS rejects path
+            log.warning("Binary write failed for %s: %s. Using fallback.", dest, exc)
+            p = dest
+            h = sha256(str(p).encode("utf-8")).hexdigest()[:16]
+            fallback = p.with_name(
+                _shorten_segment(f"{p.stem}-{h}{p.suffix}", MAX_SEG_LEN)
+            )
+            create_dir(fallback.parent)
+            with fallback.open("wb") as fh:
+                for chunk in resp.iter_content(CHUNK_SIZE):
+                    fh.write(chunk)
+            log.debug("Saved resource (fallback) -> %s", fallback)
+    except Exception as exc:  # noqa: BLE001
+        log.error("Failed to save %s – %s", url, exc)
+
+
+# ---------------------------------------------------------------------------
+# Link rewriting
+# ---------------------------------------------------------------------------
+
+
+def rewrite_links(
+    soup: BeautifulSoup, page_url: str, site_root: Path, page_dir: Path
+) -> None:
+    """Rewrite internal links to local relative paths under site_root."""
+    root_netloc = urlparse(page_url).netloc
+    for tag in soup.find_all(["a", "img", "script", "link"]):
+        attr = "href" if tag.name in {"a", "link"} else "src"
+        if not tag.has_attr(attr):
+            continue
+        original = sanitize(tag[attr])
+        if (
+            original.startswith("#")
+            or is_non_fetchable(original)
+            or not is_httpish(original)
+        ):
+            continue
+        abs_url = urljoin(page_url, original)
+        if not is_internal(abs_url, root_netloc):
+            continue  # external – leave untouched
+        local_path = to_local_path(urlparse(abs_url), site_root)
+        try:
+            tag[attr] = os.path.relpath(local_path, page_dir)
+        except ValueError:
+            # Different drives on Windows, etc.
+            tag[attr] = str(local_path)
+
+
+# ---------------------------------------------------------------------------
+# Crawl coordinator
+# ---------------------------------------------------------------------------
+
+
+def crawl_site(start_url: str, root: Path, max_pages: int, threads: int) -> None:
+    """Breadth-first crawl limited to max_pages. Downloads assets via workers."""
+    q_pages: queue.Queue[str] = queue.Queue()
+    q_pages.put(start_url)
+    seen_pages: set[str] = set()
+    download_q: queue.Queue[tuple[str, Path]] = queue.Queue()
+
+    def worker() -> None:
+        while True:
+            try:
+                url, dest = download_q.get(timeout=3)
+            except queue.Empty:
+                return
+            if is_non_fetchable(url) or not is_httpish(url):
+                log.debug("Skip non-fetchable: %s", url)
+                download_q.task_done()
+                continue
+            fetch_binary(url, dest)
+            download_q.task_done()
+
+    workers: list[threading.Thread] = []
+    for i in range(max(1, threads)):
+        t = threading.Thread(target=worker, name=f"DL-{i+1}", daemon=True)
+        t.start()
+        workers.append(t)
+
+    start_time = time.time()
+    root_netloc = urlparse(start_url).netloc
+
+    while not q_pages.empty() and len(seen_pages) < max_pages:
+        page_url = q_pages.get()
+        if page_url in seen_pages:
+            continue
+        seen_pages.add(page_url)
+        log.info("[%s/%s] %s", len(seen_pages), max_pages, page_url)
+
+        soup = fetch_html(page_url)
+        if soup is None:
+            continue
+
+        # Gather links & assets
+        for tag in soup.find_all(["img", "script", "link", "a"]):
+            link = tag.get("src") or tag.get("href")
+            if not link:
+                continue
+            link = sanitize(link)
+            if link.startswith("#") or is_non_fetchable(link) or not is_httpish(link):
+                continue
+            abs_url = urljoin(page_url, link)
+            parsed = urlparse(abs_url)
+            if not is_internal(abs_url, root_netloc):
+                continue
+
+            dest_path = to_local_path(parsed, root)
+            # HTML?
+            if parsed.path.endswith("/") or not Path(parsed.path).suffix:
+                if abs_url not in seen_pages and abs_url not in list(
+                    q_pages.queue
+                ):  # type: ignore[arg-type]
+                    q_pages.put(abs_url)
+            else:
+                download_q.put((abs_url, dest_path))
+
+        # Save current page
+        local_path = to_local_path(urlparse(page_url), root)
+        create_dir(local_path.parent)
+        rewrite_links(soup, page_url, root, local_path.parent)
+        html = soup.prettify()
+        final_path = safe_write_text(local_path, html, encoding="utf-8")
+        log.debug("Saved page %s", final_path)
+
+    download_q.join()
+    elapsed = time.time() - start_time
+    if seen_pages:
+        log.info(
+            "Crawl finished: %s pages in %.2fs (%.2fs avg)",
+            len(seen_pages),
+            elapsed,
+            elapsed / len(seen_pages),
+        )
+    else:
+        log.warning("Nothing downloaded – check URL or connectivity")
+
+
+# ---------------------------------------------------------------------------
+# Helper function for output folder
+# ---------------------------------------------------------------------------
+
+
+def make_root(url: str, custom: Optional[str]) -> Path:
+    """Derive output folder from URL if custom not supplied."""
+    return Path(custom) if custom else Path(urlparse(url).netloc.replace(".", "_"))
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+
+def parse_args() -> argparse.Namespace:
+    p = argparse.ArgumentParser(
+        description="Recursively mirror a website for offline use.",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    p.add_argument(
+        "--url",
+        required=True,
+        help="Starting URL to crawl (e.g., https://example.com/).",
+    )
+    p.add_argument(
+        "--destination",
+        default=None,
+        help="Output folder (defaults to a folder derived from the URL).",
+    )
+    p.add_argument(
+        "--max-pages",
+        type=int,
+        default=50,
+        help="Maximum number of HTML pages to crawl.",
+    )
+    p.add_argument(
+        "--threads",
+        type=int,
+        default=6,
+        help="Number of concurrent download workers.",
+    )
+    return p.parse_args()
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    if args.max_pages < 1:
+        log.error("--max-pages must be >= 1")
+        sys.exit(2)
+    if args.threads < 1:
+        log.error("--threads must be >= 1")
+        sys.exit(2)
+
+    host = args.url
+    root = make_root(args.url, args.destination)
+    crawl_site(host, root, args.max_pages, args.threads)