This project conducts sentiment analysis on public opinions about OpenAI shared on LinkedIn. Using natural language processing (NLP) techniques, the system analyses posts and comments to categorise sentiment as positive, neutral, or negative. The project was developed as part of an engineering thesis with a focus on educational purposes.
- Data collection from LinkedIn using Apify scraper
- Robust text preprocessing to handle URLs, slang, multilingual content
- Manual labelling functionality for creating training datasets
- Transfer Learning with RoBERTa model for sentiment classification
- Cross-validation to evaluate model performance
- Data visualisation with histograms and word clouds
The project follows a complete machine learning pipeline:
- Data Collection: Scraping LinkedIn posts and comments containing "openai"
- Preprocessing: Text cleaning, language detection, URL tokenisation
- Labelling: Manual sentiment annotation interface
- Model Training: Fine-tuning RoBERTa with Transfer Learning
- Evaluation: Using cross-validation and accuracy metrics
- Visualisation: Displaying sentiment distributions and word frequencies
- English language detection using Lingua
- URL standardisation and removal of duplicates
- Sentiment categorisation (Positive, Neutral, Negative)
The sentiment analysis uses:
- Base model: RoBERTa
- Transfer Learning: Fine-tuning on domain-specific data
- Layer freezing: 10 bottom layers frozen to preserve general language understanding
The project explores several key data science and NLP concepts:
- Web Scraping: Ethical data collection techniques from social media platforms
- Natural Language Processing: Text preprocessing, tokenisation, and sentiment analysis
- Transfer Learning: Adapting pre-trained language models to new domains
- Feature Engineering: Extracting relevant features from text data
- Model Evaluation: Cross-validation techniques and performance metrics
- Class Imbalance: Handling uneven distribution of sentiment classes
- Hyperparameter Tuning: Optimizing model parameters for better performance
- Data Visualisation: Techniques for presenting text data analysis results
- Python 3.10
- PyTorch
- Transformers (Hugging Face)
- Apify client
- Lingua
pip install -r requirements.txt
Additionally, the project requires pytorch package with GPU compatible Nvidia CUDA version, which can differ across devices. Check your GPU's CUDA version with: nvidia-smi and install compatible pytorch package.