A modern, scalable, and extensible web crawler for efficient distributed crawling and data extraction
Built with asynchronous I/O, plugin architecture, and distributed task processing
- π Asynchronous Crawling - Non-blocking I/O with
aiohttpandasynciofor high performance - π Distributed Processing - Scale across multiple workers using Celery and Redis
- πΎ Database Persistence - PostgreSQL storage with SQLAlchemy ORM
- π Plugin Architecture - Extensible system for custom data processing
- π Robust Logging - Console, file, and database logging for diagnostics
- π URL Normalization - Smart deduplication and link management
- πΈοΈ Web Scraper - Comprehensive webpage data extraction
- π Title Logger - Extract and store page titles
- π€ Entity Extraction - NLP-based named entity recognition (spaCy)
- π Dynamic Scraper - JavaScript-rendered pages (Playwright)
- π Real-time Metrics - Live crawl statistics via WebSocket
- Python 3.11+
- PostgreSQL
- Redis
# Clone repository
git clone https://github.com/roshanlam/spider.git
cd spider
# Install with Poetry (recommended)
poetry install
# Install Playwright browsers
poetry run playwright install chromium
# Download spaCy model
poetry run python -m spacy download en_core_web_smEdit src/spider/config.yaml:
start_url: "http://example.com"
rate_limit: 1 # seconds between requests
threads: 8
timeout: 10
database:
url: "postgresql://username@localhost/crawlerdb"
celery:
broker_url: "redis://localhost:6379/0"
result_backend: "redis://localhost:6379/0"Simple way:
poetry run python run.pyOr using module:
poetry run python -m spider.main# View all scraped data
poetry run python query_data.py
# Or programmatically
poetry run python
>>> from spider.plugins.scraper_utils import ScraperDataQuery
>>> query = ScraperDataQuery()
>>> page = query.get_page_data("http://example.com")
>>> print(page['title'])spider/
βββ src/spider/
β βββ spider.py # Core async crawler
β βββ plugin.py # Plugin system
β βββ storage.py # Database persistence
β βββ link_finder.py # HTML parsing and link extraction
β βββ tasks.py # Celery distributed tasks
β βββ config.py # Configuration loader
β βββ utils.py # URL normalization and utilities
β βββ plugins/
β βββ web_scraper_plugin.py # Comprehensive web scraper
β βββ scraper_utils.py # Query utilities
β βββ title_logger_plugin.py # Title extraction
β βββ entity_extraction.py # NLP entity extraction
β βββ dynamic_scraper.py # JavaScript rendering
β βββ real_time_metrics.py # Live metrics
βββ docs/ # Documentation
βββ examples/ # Usage examples
βββ tests/ # Test suite
βββ run.py # Simple runner script
βββ query_data.py # Data query script
βββ pyproject.toml # Poetry dependencies
Spider uses a powerful plugin architecture for extensibility.
The comprehensive web scraper extracts structured data from every page:
from spider.plugins.scraper_utils import ScraperDataQuery
query = ScraperDataQuery()
# Get page data
page = query.get_page_data("http://example.com")
print(f"Title: {page['title']}")
print(f"Words: {page['word_count']}")
print(f"Links: {len(page['links'])}")
# Search pages
results = query.search_by_title("python")
# Get statistics
stats = query.get_page_statistics()
print(f"Total pages: {stats['total_pages']}")What gets extracted:
- Metadata (title, description, keywords, author, language)
- Content structure (headings, word count, text analysis)
- Links (internal/external with anchor text)
- Images (URLs, alt text, dimensions)
- Forms (actions, methods, input fields)
- Social metadata (OpenGraph, Twitter Card)
- Structured data (JSON-LD)
- Page structure (semantic HTML)
π Full documentation: docs/web-scraper/
from spider.plugin import Plugin
class MyPlugin(Plugin):
async def should_run(self, url: str, content: str) -> bool:
return True # Run on all pages
async def process(self, url: str, content: str) -> str:
# Your processing logic here
print(f"Processing {url}")
return content
# Register in main.py
plugin_manager.register(MyPlugin())π Plugin documentation: Plugin.md
Run Spider across multiple workers for large-scale crawling.
celery -A spider.tasks.celery_app worker --loglevel=infofrom spider.tasks import crawl_task
result = crawl_task.delay("https://example.com")
print(f"Task ID: {result.id}")import asyncio
from spider.spider import Spider
from spider.config import config
from spider.plugin import PluginManager
from spider.plugins.web_scraper_plugin import WebScraperPlugin
# Setup
plugin_manager = PluginManager()
plugin_manager.register(WebScraperPlugin())
# Create and run crawler
crawler = Spider(config['start_url'], config, plugin_manager)
asyncio.run(crawler.crawl())from spider.plugins.scraper_utils import ScraperDataQuery
query = ScraperDataQuery()
# Get all pages
pages = query.get_all_pages(limit=10)
for page in pages:
print(f"{page['url']}: {page['title']}")
# Find pages with forms
pages_with_forms = query.get_pages_with_forms()
# Export data
query.export_to_json("http://example.com", "output.json")import json
pages = query.get_all_pages()
for page in pages:
# Check for SEO issues
if not page['description']:
print(f"β οΈ Missing description: {page['url']}")
headings = json.loads(page['headings'])
if len(headings['h1']) == 0:
print(f"β οΈ No H1 heading: {page['url']}")# Run all tests
poetry run pytest
# Run web scraper plugin tests
poetry run python test_web_scraper_plugin.py
# Check coverage
poetry run pytest --cov=spider- Web Scraper Plugin - Complete plugin documentation
- Quick Start Guide - Get started in 3 steps
- Quick Reference - Command cheat sheet
- Plugin System - Creating custom plugins
- Examples - Code examples and use cases
We welcome contributions! Here's how to get started:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please read CONTRIBUTING.md for detailed guidelines.
- Python 3.11+
- PostgreSQL 12+
- Redis 6+
- Poetry (package manager)
See pyproject.toml for complete dependencies.
MIT License - See LICENSE for details.
Built with:
- aiohttp - Async HTTP client/server
- BeautifulSoup - HTML parsing
- Celery - Distributed task queue
- SQLAlchemy - SQL toolkit and ORM
- Playwright - Browser automation
- spaCy - NLP library
- Bug Reports: GitHub Issues
- Questions: GitHub Discussions
- Documentation: docs/
Made with β€οΈ by Roshan Lamichhaner
