A data science solution for extracting structured information from M&A (Mergers & Acquisitions) document requests in Excel format, and creating organized folder structures.
This project addresses two main challenges:
- Information Extraction: Parse diverse Excel file formats containing M&A document requests and generate a structured folder hierarchy
- GDPR Compliance: Identify and redact personal data from documents to ensure regulatory compliance
- Multi-format Excel support (
.xlsxand.xls) - Automatic hidden sheet filtering
- Multiple parsing strategies for different Excel structures
- LLM-powered folder name generation using structured outputs
- Async processing with rate limiting for API calls
- GDPR-compliant text anonymization (regex + LLM approaches)
- Document extraction from PDF and DOCX files
# Create virtual environment and install dependencies
uv venv
uv pip install -r requirements.txtpython -m venv venv
venv\Scripts\activate # Windows
pip install -r requirements.txt-
Copy
.env.exampleto.env:copy .env.example .env
-
Add your OpenAI API key to
.env:OPENAI_API_KEY=your_api_key_here
python main.py --data-folder ./data --output-dir ./outputinformation_extraction/
├── README.md
├── requirements.txt
├── .env.example
├── main.py # Entry point
├── information_extraction_excel.ipynb # Jupyter notebook with full implementation
├── src/
│ ├── __init__.py
│ ├── excel_loader.py # Excel loading & sheet filtering
│ ├── cleaners.py # Data cleaning functions
│ ├── llm_processor.py # LLM calls & folder generation
│ ├── anonymizer.py # GDPR redaction functions
│ ├── document_extractor.py # PDF/DOCX text extraction
│ └── utils.py # Utilities (logging, validation)
├── data/ # Input Excel files
└── logs/ # Processing logs
Handles loading Excel files, filtering hidden sheets, and reading visible sheets into pandas DataFrames.
Contains multiple cleaning functions (clean_excel_information_table, clean_excel_information_2, etc.) to handle different Excel formats.
Manages LLM API calls for generating folder structures:
- Async batch processing with semaphores
- Retry logic with exponential backoff
- Structured output using Pydantic models
GDPR compliance utilities:
- Regex-based anonymization for emails, phones, addresses, IDs
- LLM-based anonymization for complex cases
Text extraction from various document formats (PDF, DOCX).
- ~30 rows: 23-30 seconds, ~$0.03 cost
- ~120 rows: 2.5-3 minutes, ~$0.05 cost
- ~5 Excel files (all samples): ~$0.30-0.70 cost and takes 9.5 mins (with 3 retries, could be up to ~$1.60)
- Async processing provides ~600% speedup vs sequential (3 minutes vs 20 minutes)
- LLM-based Excel parsing: Instead of manually handling different Excel formats with custom cleaning functions, consider using LLMs to directly extract structured data from Excel files. This would reduce the need for format-specific preprocessing.
- Instruction-aware processing: Parse and incorporate instructions from Excel sheets (once documents become available) into LLM prompts for more context-aware context extraction.
- Non-LLM alternative: It's possible to merge Sector/Area columns without LLMs to create folder hierarchies, but this approach is:
- Case-sensitive and brittle
- Not robust to upper/lowercase variations
- Requires extensive special character cleaning
- Less flexible for new formats
- Enhanced anonymization: Current regex-based approach has limitations:
- May miss context-dependent PII
- Can't handle complex linguistic patterns
- Recommendation: Prefer LLM-based anonymization for production use
- Validation framework: Implement "LLM as judge" to verify that fields are correctly anonymized
- NER integration: Consider using Named Entity Recognition (NER) models for more robust PII detection
- Configuration management: Add a config file (YAML/TOML) for:
- Model selection and parameters
- Rate limiting settings
- Folder naming conventions
- Anonymization rules
- Standardization: Current implementation attempts to standardize DataFrame formats across different Excel structures, but this could be further improved with a unified schema
- Model selection: Consider using models from other providers for simple tasks
- Caching: Implement caching for similar requests to reduce redundant API calls
- Unit tests: Tests for all cleaner functions
- Integration tests: Add end-to-end tests with sample Excel files
- Validation tests: Automated checks for GDPR compliance effectiveness
I want to split the info in this excel into many columns. These have to be named Section, Area, and Requested information. Section is a row that starts with latin arithmetics followed by a name in capitals, for example "IV. ASSETS". We should name the section below this as "Assets". Area is a row starting with letters and followed by a name like "A. Real Estate". If missing, it should be the same as Section name. The requested information starts with a number and istructions are followed. For example, "6. Delegations of powers and powers of attorney". It might be that if after that row with the number, there are other rows with small letters like "a. the country of incorporation;" these also are part of the requested info of that row.