🕷️ Spider

A modern, scalable, and extensible web crawler for efficient distributed crawling and data extraction

Built with asynchronous I/O, plugin architecture, and distributed task processing

✨ Features

Core Capabilities

🚀 Asynchronous Crawling - Non-blocking I/O with aiohttp and asyncio for high performance
🌐 Distributed Processing - Scale across multiple workers using Celery and Redis
💾 Database Persistence - PostgreSQL storage with SQLAlchemy ORM
🔌 Plugin Architecture - Extensible system for custom data processing
📊 Robust Logging - Console, file, and database logging for diagnostics
🔗 URL Normalization - Smart deduplication and link management

Included Plugins

🕸️ Web Scraper - Comprehensive webpage data extraction
📝 Title Logger - Extract and store page titles
🤖 Entity Extraction - NLP-based named entity recognition (spaCy)
🎭 Dynamic Scraper - JavaScript-rendered pages (Playwright)
📈 Real-time Metrics - Live crawl statistics via WebSocket

🚀 Quick Start

Prerequisites

Python 3.11+
PostgreSQL
Redis

Installation

# Clone repository
git clone https://github.com/roshanlam/spider.git
cd spider

# Install with Poetry (recommended)
poetry install

# Install Playwright browsers
poetry run playwright install chromium

# Download spaCy model
poetry run python -m spacy download en_core_web_sm

Configuration

Edit src/spider/config.yaml:

start_url: "http://example.com"
rate_limit: 1  # seconds between requests
threads: 8
timeout: 10

database:
  url: "postgresql://username@localhost/crawlerdb"

celery:
  broker_url: "redis://localhost:6379/0"
  result_backend: "redis://localhost:6379/0"

Run the Crawler

Simple way:

poetry run python run.py

Or using module:

poetry run python -m spider.main

Query Scraped Data

# View all scraped data
poetry run python query_data.py

# Or programmatically
poetry run python
>>> from spider.plugins.scraper_utils import ScraperDataQuery
>>> query = ScraperDataQuery()
>>> page = query.get_page_data("http://example.com")
>>> print(page['title'])

📦 Project Structure

spider/
├── src/spider/
│   ├── spider.py           # Core async crawler
│   ├── plugin.py           # Plugin system
│   ├── storage.py          # Database persistence
│   ├── link_finder.py      # HTML parsing and link extraction
│   ├── tasks.py            # Celery distributed tasks
│   ├── config.py           # Configuration loader
│   ├── utils.py            # URL normalization and utilities
│   └── plugins/
│       ├── web_scraper_plugin.py      # Comprehensive web scraper
│       ├── scraper_utils.py           # Query utilities
│       ├── title_logger_plugin.py     # Title extraction
│       ├── entity_extraction.py       # NLP entity extraction
│       ├── dynamic_scraper.py         # JavaScript rendering
│       └── real_time_metrics.py       # Live metrics
├── docs/                   # Documentation
├── examples/               # Usage examples
├── tests/                  # Test suite
├── run.py                  # Simple runner script
├── query_data.py           # Data query script
└── pyproject.toml          # Poetry dependencies

🔌 Plugin System

Spider uses a powerful plugin architecture for extensibility.

Using the Web Scraper Plugin

The comprehensive web scraper extracts structured data from every page:

from spider.plugins.scraper_utils import ScraperDataQuery

query = ScraperDataQuery()

# Get page data
page = query.get_page_data("http://example.com")
print(f"Title: {page['title']}")
print(f"Words: {page['word_count']}")
print(f"Links: {len(page['links'])}")

# Search pages
results = query.search_by_title("python")

# Get statistics
stats = query.get_page_statistics()
print(f"Total pages: {stats['total_pages']}")

What gets extracted:

Metadata (title, description, keywords, author, language)
Content structure (headings, word count, text analysis)
Links (internal/external with anchor text)
Images (URLs, alt text, dimensions)
Forms (actions, methods, input fields)
Social metadata (OpenGraph, Twitter Card)
Structured data (JSON-LD)
Page structure (semantic HTML)

📚 Full documentation: docs/web-scraper/

Creating Custom Plugins

from spider.plugin import Plugin

class MyPlugin(Plugin):
    async def should_run(self, url: str, content: str) -> bool:
        return True  # Run on all pages

    async def process(self, url: str, content: str) -> str:
        # Your processing logic here
        print(f"Processing {url}")
        return content

# Register in main.py
plugin_manager.register(MyPlugin())

📚 Plugin documentation: Plugin.md

🌐 Distributed Mode

Run Spider across multiple workers for large-scale crawling.

Start Celery Worker

celery -A spider.tasks.celery_app worker --loglevel=info

Queue Tasks

from spider.tasks import crawl_task

result = crawl_task.delay("https://example.com")
print(f"Task ID: {result.id}")

📊 Usage Examples

Basic Crawling

import asyncio
from spider.spider import Spider
from spider.config import config
from spider.plugin import PluginManager
from spider.plugins.web_scraper_plugin import WebScraperPlugin

# Setup
plugin_manager = PluginManager()
plugin_manager.register(WebScraperPlugin())

# Create and run crawler
crawler = Spider(config['start_url'], config, plugin_manager)
asyncio.run(crawler.crawl())

Query Data

from spider.plugins.scraper_utils import ScraperDataQuery

query = ScraperDataQuery()

# Get all pages
pages = query.get_all_pages(limit=10)
for page in pages:
    print(f"{page['url']}: {page['title']}")

# Find pages with forms
pages_with_forms = query.get_pages_with_forms()

# Export data
query.export_to_json("http://example.com", "output.json")

SEO Analysis

import json

pages = query.get_all_pages()
for page in pages:
    # Check for SEO issues
    if not page['description']:
        print(f"⚠️ Missing description: {page['url']}")

    headings = json.loads(page['headings'])
    if len(headings['h1']) == 0:
        print(f"⚠️ No H1 heading: {page['url']}")

🧪 Testing

# Run all tests
poetry run pytest

# Run web scraper plugin tests
poetry run python test_web_scraper_plugin.py

# Check coverage
poetry run pytest --cov=spider

📚 Documentation

Web Scraper Plugin - Complete plugin documentation
Quick Start Guide - Get started in 3 steps
Quick Reference - Command cheat sheet
Plugin System - Creating custom plugins
Examples - Code examples and use cases

🤝 Contributing

We welcome contributions! Here's how to get started:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Please read CONTRIBUTING.md for detailed guidelines.

📋 Requirements

Python 3.11+
PostgreSQL 12+
Redis 6+
Poetry (package manager)

See pyproject.toml for complete dependencies.

📄 License

MIT License - See LICENSE for details.

🙏 Acknowledgments

Built with:

aiohttp - Async HTTP client/server
BeautifulSoup - HTML parsing
Celery - Distributed task queue
SQLAlchemy - SQL toolkit and ORM
Playwright - Browser automation
spaCy - NLP library

🐛 Issues & Support

Bug Reports: GitHub Issues
Questions: GitHub Discussions
Documentation: docs/

Made with ❤️ by Roshan Lamichhaner

⭐ Star us on GitHub!

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
client		client
docs		docs
examples		examples
src		src
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
Plugin.md		Plugin.md
README.md		README.md
crawl.sh		crawl.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
query.sh		query.sh
query_data.py		query_data.py
requirements.txt		requirements.txt
run.py		run.py
run_crawler.py		run_crawler.py
setup.cfg		setup.cfg
spider.png		spider.png

License

roshanlam/Spider

Folders and files

Latest commit

History

Repository files navigation

🕷️ Spider

✨ Features

Core Capabilities

Included Plugins

🚀 Quick Start

Prerequisites

Installation

Configuration

Run the Crawler

Query Scraped Data

📦 Project Structure

🔌 Plugin System

Using the Web Scraper Plugin

Creating Custom Plugins

🌐 Distributed Mode

Start Celery Worker

Queue Tasks

📊 Usage Examples

Basic Crawling

Query Data

SEO Analysis

🧪 Testing

📚 Documentation

🤝 Contributing

📋 Requirements

📄 License

🙏 Acknowledgments

🐛 Issues & Support

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages