Case Study: Logistics Document Processing Automation for an FMCG Company - File to order

The Challenge

A European FMCG company operating in the Polish market spent a significant amount of employee time manually rewriting logistics documents into the formats required by their 3PL operator — Raben. The process involved two key workflows:

Outbound orders — the company receives purchase orders from its trading partners. Every partner sends orders in a completely different format: some as Excel spreadsheets, others as PDFs, and one as text files. Each order had to be manually retyped into Raben's standardized 48-column template before shipping could be scheduled.

Inbound deliveries — delivery advices from the supplier arrived as bilingual PDF documents with a complex nested structure: product blocks containing multiple batches, each with its own expiration date and number of cartons. This data had to be processed into Raben's 21-column template.

Key issues

25+ different input formats — each trading partner uses their own document layout, column naming, product identifiers, and date formats.
Manual SKU translation — client product codes had to be looked up in an internal matrix to find the correct warehouse SKU codes.
Multi-warehouse address resolution — over half of the partners deliver to multiple distribution centers; the correct warehouse ID had to be manually identified for each order.
Quantity conversion — some partners report quantities in cartons, others in units, which required manual multiplication for every single product.
High error rate — manually retyping hundreds of product lines a day frequently led to mistakes.
Time pressure — logistics deadlines required same-day order processing.

The Solution

I built a Python desktop application that fully automates both processes. The tool runs exclusively on the client's local computer — no data is sent to external services or the cloud.

How it works

The user drops the source files into a designated folder, clicks a single button, and within seconds receives a ready-to-use XLSX file fully compliant with the Raben template. Processed files are automatically archived by date.

Internal application logic:

Source identification — analyzing the file structure, content signatures, and format to determine which trading partner (or supplier) the document came from.
Data extraction — a dedicated parser for each format reads the relevant fields using pattern matching, table extraction, and state machine logic.
Product code resolution — translating client-specific product identifiers into internal warehouse SKUs based on a master data matrix.
Delivery address resolution — for multi-warehouse partners, the system determines the correct Raben warehouse ID based on the delivery address using a multi-level fuzzy matching engine (postal codes, city names, warehouse codes, normalized text matching).
Output file generation — writing a complete, ready-to-upload XLSX file populated with all constant values, sequential order numbers, and converted quantities.
Archiving — moving processed files into date-stamped folders; files containing errors are sent to a verification queue.

Technical Challenges

Cryptographic PDF decoding — one trading partner's system generates PDFs with custom font encoding (CID), where characters are replaced by numeric codes. Standard PDF libraries returned unreadable gibberish. I developed a technique that uses known embedded text patterns (file paths, order numbers) to automatically reconstruct the character mapping — essentially a known-plaintext attack that decodes the document without any manual intervention.

Bilingual delivery advice parser — bilingual PDF delivery advices have a nested structure where a single product might span multiple batches, each with a different expiration date and quantities expressed in cartons instead of individual units. The parser utilizes a state machine to track context across rows and accurately calculates the unit quantities based on carton counts and packaging ratios.

Intelligent address matching — the address resolution system handles real-world data issues: Polish diacritics lost during PDF extraction (Wyszków → Wyszkow), multiple warehouses in the same city requiring disambiguation by warehouse codes, and cross-client collisions when different companies have warehouses in the identical city. The system employs client-filtered indexes with normalized text matching.

Dual-process GUI — a clean, tabbed interface allows non-technical employees to easily switch between processing outbound orders and inbound deliveries. It includes built-in numbering management, progress tracking, and error reporting.

Results

Metric	Before	After
Batch processing time	45–90 min (manual)	Under 10 seconds
Error rate	Frequent (manual rewriting)	Near zero (automated validation)
Supported formats	Required domain expertise	Fully automated identification
Data security	N/A	100% local processing, zero cloud exposure
Employee dependency	Trained operator required	Any team member can operate

What changed

Reclaimed hours weekly — what used to consume a significant portion of the logistics coordinator's day now takes seconds.
Elimination of human error — automated SKU resolution and quantity conversions removed the most common source of mistakes.
Removal of specialist dependency — previously, only one trained person could process orders; now anyone on the team can click the button.
Scalability — adding a new trading partner only requires a new parser module; the rest of the pipeline functions automatically.
Resilience to changes — when a supplier modified their PDF layout, the modular architecture allowed for a targeted fix without affecting the rest of the system.

Technology

Component	Details
Language	Python 3.11+
GUI	Tkinter (native, no browser)
PDF processing	pdfplumber
Excel handling	openpyxl, xlrd
Architecture	Modular pipeline with interchangeable parsers
Deployment	Local application, serverless, cloudless
Data security	On-device processing only; data never leaves the computer

This solution was built as a dedicated business process automation project. The application runs entirely on the client's local hardware, guaranteeing that sensitive commercial data — pricing, order volumes, customer relationships — never leaves their control.