Information Extraction from Excel Files

A data science solution for extracting structured information from M&A (Mergers & Acquisitions) document requests in Excel format, and creating organized folder structures.

Project Overview

This project addresses two main challenges:

Information Extraction: Parse diverse Excel file formats containing M&A document requests and generate a structured folder hierarchy
GDPR Compliance: Identify and redact personal data from documents to ensure regulatory compliance

Features

Multi-format Excel support (.xlsx and .xls)
Automatic hidden sheet filtering
Multiple parsing strategies for different Excel structures
LLM-powered folder name generation using structured outputs
Async processing with rate limiting for API calls
GDPR-compliant text anonymization (regex + LLM approaches)
Document extraction from PDF and DOCX files

Installation

Using uv (Recommended)

# Create virtual environment and install dependencies
uv venv
uv pip install -r requirements.txt

Using pip

python -m venv venv
venv\Scripts\activate  # Windows
pip install -r requirements.txt

Configuration

Copy .env.example to .env:
```
copy .env.example .env
```
Add your OpenAI API key to .env:
```
OPENAI_API_KEY=your_api_key_here
```

Usage

Basic Usage

python main.py --data-folder ./data --output-dir ./output

Project Structure

information_extraction/
├── README.md
├── requirements.txt
├── .env.example
├── main.py                               # Entry point
├── information_extraction_excel.ipynb    # Jupyter notebook with full implementation
├── src/
│   ├── __init__.py
│   ├── excel_loader.py                   # Excel loading & sheet filtering
│   ├── cleaners.py                       # Data cleaning functions
│   ├── llm_processor.py                  # LLM calls & folder generation
│   ├── anonymizer.py                     # GDPR redaction functions
│   ├── document_extractor.py             # PDF/DOCX text extraction
│   └── utils.py                          # Utilities (logging, validation)
├── data/                                 # Input Excel files
└── logs/                                 # Processing logs

Module Overview

`excel_loader.py`

Handles loading Excel files, filtering hidden sheets, and reading visible sheets into pandas DataFrames.

`cleaners.py`

Contains multiple cleaning functions (clean_excel_information_table, clean_excel_information_2, etc.) to handle different Excel formats.

`llm_processor.py`

Manages LLM API calls for generating folder structures:

Async batch processing with semaphores
Retry logic with exponential backoff
Structured output using Pydantic models

`anonymizer.py`

GDPR compliance utilities:

Regex-based anonymization for emails, phones, addresses, IDs
LLM-based anonymization for complex cases

`document_extractor.py`

Text extraction from various document formats (PDF, DOCX).

Performance Notes

~30 rows: 23-30 seconds, ~$0.03 cost
~120 rows: 2.5-3 minutes, ~$0.05 cost
~5 Excel files (all samples): ~$0.30-0.70 cost and takes 9.5 mins (with 3 retries, could be up to ~$1.60)
Async processing provides ~600% speedup vs sequential (3 minutes vs 20 minutes)

Future Improvements & Notes

Information Extraction

LLM-based Excel parsing: Instead of manually handling different Excel formats with custom cleaning functions, consider using LLMs to directly extract structured data from Excel files. This would reduce the need for format-specific preprocessing.
Instruction-aware processing: Parse and incorporate instructions from Excel sheets (once documents become available) into LLM prompts for more context-aware context extraction.
Non-LLM alternative: It's possible to merge Sector/Area columns without LLMs to create folder hierarchies, but this approach is:
- Case-sensitive and brittle
- Not robust to upper/lowercase variations
- Requires extensive special character cleaning
- Less flexible for new formats

GDPR Compliance

Enhanced anonymization: Current regex-based approach has limitations:
- May miss context-dependent PII
- Can't handle complex linguistic patterns
- Recommendation: Prefer LLM-based anonymization for production use
Validation framework: Implement "LLM as judge" to verify that fields are correctly anonymized
NER integration: Consider using Named Entity Recognition (NER) models for more robust PII detection

Architecture

Configuration management: Add a config file (YAML/TOML) for:
- Model selection and parameters
- Rate limiting settings
- Folder naming conventions
- Anonymization rules
Standardization: Current implementation attempts to standardize DataFrame formats across different Excel structures, but this could be further improved with a unified schema

Cost Optimization

Model selection: Consider using models from other providers for simple tasks
Caching: Implement caching for similar requests to reduce redundant API calls

Testing (for production)

Unit tests: Tests for all cleaner functions
Integration tests: Add end-to-end tests with sample Excel files
Validation tests: Automated checks for GDPR compliance effectiveness

Useful Prompt - Used in Gemini to create clean_excel_information_2 function:

I want to split the info in this excel into many columns. These have to be named Section, Area, and Requested information. Section is a row that starts with latin arithmetics followed by a name in capitals, for example "IV. ASSETS". We should name the section below this as "Assets". Area is a row starting with letters and followed by a name like "A. Real Estate". If missing, it should be the same as Section name. The requested information starts with a number and istructions are followed. For example, "6. Delegations of powers and powers of attorney". It might be that if after that row with the number, there are other rows with small letters like "a. the country of incorporation;" these also are part of the requested info of that row.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Extraction from Excel Files

Project Overview

Features

Installation

Using uv (Recommended)

Using pip

Configuration

Usage

Basic Usage

Project Structure

Module Overview

`excel_loader.py`

`cleaners.py`

`llm_processor.py`

`anonymizer.py`

`document_extractor.py`

Performance Notes

Future Improvements & Notes

Information Extraction

GDPR Compliance

Architecture

Cost Optimization

Testing (for production)

Useful Prompt - Used in Gemini to create clean_excel_information_2 function:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.env.example		.env.example
README.md		README.md
information_extraction_excel.ipynb		information_extraction_excel.ipynb
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Information Extraction from Excel Files

Project Overview

Features

Installation

Using uv (Recommended)

Using pip

Configuration

Usage

Basic Usage

Project Structure

Module Overview

excel_loader.py

cleaners.py

llm_processor.py

anonymizer.py

document_extractor.py

Performance Notes

Future Improvements & Notes

Information Extraction

GDPR Compliance

Architecture

Cost Optimization

Testing (for production)

Useful Prompt - Used in Gemini to create clean_excel_information_2 function:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`excel_loader.py`

`cleaners.py`

`llm_processor.py`

`anonymizer.py`

`document_extractor.py`

Packages