Skip to content

Web Crawler built using asynchronous Python and distributed task management that extracts and saves web data for analysis.

License

Notifications You must be signed in to change notification settings

roshanlam/Spider

Repository files navigation

πŸ•·οΈ Spider

A modern, scalable, and extensible web crawler for efficient distributed crawling and data extraction

Built with asynchronous I/O, plugin architecture, and distributed task processing


✨ Features

Core Capabilities

  • πŸš€ Asynchronous Crawling - Non-blocking I/O with aiohttp and asyncio for high performance
  • 🌐 Distributed Processing - Scale across multiple workers using Celery and Redis
  • πŸ’Ύ Database Persistence - PostgreSQL storage with SQLAlchemy ORM
  • πŸ”Œ Plugin Architecture - Extensible system for custom data processing
  • πŸ“Š Robust Logging - Console, file, and database logging for diagnostics
  • πŸ”— URL Normalization - Smart deduplication and link management

Included Plugins

  • πŸ•ΈοΈ Web Scraper - Comprehensive webpage data extraction
  • πŸ“ Title Logger - Extract and store page titles
  • πŸ€– Entity Extraction - NLP-based named entity recognition (spaCy)
  • 🎭 Dynamic Scraper - JavaScript-rendered pages (Playwright)
  • πŸ“ˆ Real-time Metrics - Live crawl statistics via WebSocket

πŸš€ Quick Start

Prerequisites

  • Python 3.11+
  • PostgreSQL
  • Redis

Installation

# Clone repository
git clone https://github.com/roshanlam/spider.git
cd spider

# Install with Poetry (recommended)
poetry install

# Install Playwright browsers
poetry run playwright install chromium

# Download spaCy model
poetry run python -m spacy download en_core_web_sm

Configuration

Edit src/spider/config.yaml:

start_url: "http://example.com"
rate_limit: 1  # seconds between requests
threads: 8
timeout: 10

database:
  url: "postgresql://username@localhost/crawlerdb"

celery:
  broker_url: "redis://localhost:6379/0"
  result_backend: "redis://localhost:6379/0"

Run the Crawler

Simple way:

poetry run python run.py

Or using module:

poetry run python -m spider.main

Query Scraped Data

# View all scraped data
poetry run python query_data.py

# Or programmatically
poetry run python
>>> from spider.plugins.scraper_utils import ScraperDataQuery
>>> query = ScraperDataQuery()
>>> page = query.get_page_data("http://example.com")
>>> print(page['title'])

πŸ“¦ Project Structure

spider/
β”œβ”€β”€ src/spider/
β”‚   β”œβ”€β”€ spider.py           # Core async crawler
β”‚   β”œβ”€β”€ plugin.py           # Plugin system
β”‚   β”œβ”€β”€ storage.py          # Database persistence
β”‚   β”œβ”€β”€ link_finder.py      # HTML parsing and link extraction
β”‚   β”œβ”€β”€ tasks.py            # Celery distributed tasks
β”‚   β”œβ”€β”€ config.py           # Configuration loader
β”‚   β”œβ”€β”€ utils.py            # URL normalization and utilities
β”‚   └── plugins/
β”‚       β”œβ”€β”€ web_scraper_plugin.py      # Comprehensive web scraper
β”‚       β”œβ”€β”€ scraper_utils.py           # Query utilities
β”‚       β”œβ”€β”€ title_logger_plugin.py     # Title extraction
β”‚       β”œβ”€β”€ entity_extraction.py       # NLP entity extraction
β”‚       β”œβ”€β”€ dynamic_scraper.py         # JavaScript rendering
β”‚       └── real_time_metrics.py       # Live metrics
β”œβ”€β”€ docs/                   # Documentation
β”œβ”€β”€ examples/               # Usage examples
β”œβ”€β”€ tests/                  # Test suite
β”œβ”€β”€ run.py                  # Simple runner script
β”œβ”€β”€ query_data.py           # Data query script
└── pyproject.toml          # Poetry dependencies

πŸ”Œ Plugin System

Spider uses a powerful plugin architecture for extensibility.

Using the Web Scraper Plugin

The comprehensive web scraper extracts structured data from every page:

from spider.plugins.scraper_utils import ScraperDataQuery

query = ScraperDataQuery()

# Get page data
page = query.get_page_data("http://example.com")
print(f"Title: {page['title']}")
print(f"Words: {page['word_count']}")
print(f"Links: {len(page['links'])}")

# Search pages
results = query.search_by_title("python")

# Get statistics
stats = query.get_page_statistics()
print(f"Total pages: {stats['total_pages']}")

What gets extracted:

  • Metadata (title, description, keywords, author, language)
  • Content structure (headings, word count, text analysis)
  • Links (internal/external with anchor text)
  • Images (URLs, alt text, dimensions)
  • Forms (actions, methods, input fields)
  • Social metadata (OpenGraph, Twitter Card)
  • Structured data (JSON-LD)
  • Page structure (semantic HTML)

πŸ“š Full documentation: docs/web-scraper/

Creating Custom Plugins

from spider.plugin import Plugin

class MyPlugin(Plugin):
    async def should_run(self, url: str, content: str) -> bool:
        return True  # Run on all pages

    async def process(self, url: str, content: str) -> str:
        # Your processing logic here
        print(f"Processing {url}")
        return content

# Register in main.py
plugin_manager.register(MyPlugin())

πŸ“š Plugin documentation: Plugin.md


🌐 Distributed Mode

Run Spider across multiple workers for large-scale crawling.

Start Celery Worker

celery -A spider.tasks.celery_app worker --loglevel=info

Queue Tasks

from spider.tasks import crawl_task

result = crawl_task.delay("https://example.com")
print(f"Task ID: {result.id}")

πŸ“Š Usage Examples

Basic Crawling

import asyncio
from spider.spider import Spider
from spider.config import config
from spider.plugin import PluginManager
from spider.plugins.web_scraper_plugin import WebScraperPlugin

# Setup
plugin_manager = PluginManager()
plugin_manager.register(WebScraperPlugin())

# Create and run crawler
crawler = Spider(config['start_url'], config, plugin_manager)
asyncio.run(crawler.crawl())

Query Data

from spider.plugins.scraper_utils import ScraperDataQuery

query = ScraperDataQuery()

# Get all pages
pages = query.get_all_pages(limit=10)
for page in pages:
    print(f"{page['url']}: {page['title']}")

# Find pages with forms
pages_with_forms = query.get_pages_with_forms()

# Export data
query.export_to_json("http://example.com", "output.json")

SEO Analysis

import json

pages = query.get_all_pages()
for page in pages:
    # Check for SEO issues
    if not page['description']:
        print(f"⚠️ Missing description: {page['url']}")

    headings = json.loads(page['headings'])
    if len(headings['h1']) == 0:
        print(f"⚠️ No H1 heading: {page['url']}")

πŸ§ͺ Testing

# Run all tests
poetry run pytest

# Run web scraper plugin tests
poetry run python test_web_scraper_plugin.py

# Check coverage
poetry run pytest --cov=spider

πŸ“š Documentation


🀝 Contributing

We welcome contributions! Here's how to get started:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Please read CONTRIBUTING.md for detailed guidelines.


πŸ“‹ Requirements

  • Python 3.11+
  • PostgreSQL 12+
  • Redis 6+
  • Poetry (package manager)

See pyproject.toml for complete dependencies.


πŸ“„ License

MIT License - See LICENSE for details.


πŸ™ Acknowledgments

Built with:


πŸ› Issues & Support


Made with ❀️ by Roshan Lamichhaner

⭐ Star us on GitHub!

About

Web Crawler built using asynchronous Python and distributed task management that extracts and saves web data for analysis.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •