About

POC for an automated system extracting invoice data from mail attachments using computer vision, and sending the extracted data to a Google Sheet for further analysis by business teams.

Objective

Invoices are key elements in most business activities. However, more often than not, invoices are pdf files that are not considered "structured" data-wise: it can be a challenge to leverage invoice data for analysis purposes as it would require extracting said data beforehand, costing time and effort when the extraction is done manually.
That's where an automated data extraction system comes in handy. When a company has a dedicated mail address for invoices (e.g. [email protected]), it can be beneficial to set up a system that automatically extracts invoice data from new mails. Extracted data can then be added to a Google Sheet for further analysis by business teams.

Tools

Requirements

The following table lists the requirements taken into account for the system design.
Since this is a POC, not all the requirements were implemented in the development.

Number	Requirement	Description
1	Process "unread" mails	The "unread" status is what indicates to the script that the mail is "new" and must be processed.
2	Mark "unread" mails as "read" after being processed	Ensures that the mail won't be processed again by the next script execution.
3	Extract mail subject	In the destination file/database, each record must contain the mail subject. The subject can be used to easily find a mail and perform a manual check if necessary.
4	Process mails with single or multiple attachments	When a mail contains multiple attachments, the script must be able to process all the attachments.
5	Process multiple attachments individually	When a mail contains multiple attachments, the script must be able to process every attachment individually, one after another.
6	Process an attachment with a single page or multiple pages	When an attachment contains multiple pages, the script must be able to process every page to extract the correct data.
7	Cancel processing if the attachment is not a pdf file	The script only accepts pdf file as input.
8	Cancel processing if the attachment is not an invoice	When passed to the OpenAI completion chat API, the model must check the input file to verify that it's an invoice. If the input file is not recognized as an invoice, the model must not return any data. Otherwise, the processing can continue.
9	Cancel processing if the invoice record already exists in the destination file/database	When the invoice identifier has been extracted, the script must check that the identifier doesn't already exist in the destination file/database. If it already exists, processing must be cancelled and no data will be added to the destination file/database.
10	Cancel processing if the data cannot be extracted successfully	If the OpenAI completion chat API doesn't return any data, then the processing must be cancelled, no data will be added to the destination file/database.
11	Cancel processing if the data extracted doesn't have the correct output format	If the OpenAI completion chat API returns data with an incorrect output structure, then the processing must be cancelled and no data will be added to the destination file/database.
12	Save error event in an error log when processing is canceled	When a processing is cancelled, the event must be saved in an error log describing the error type. Error types: 1) Wrong file format: not a pdf file 2) Wrong file type: not an invoice 3) Data couldn't be extracted: no data returned 4) Wrong output format: output must be as follows: {'identifier': 'INVO-005', 'amount': '425'} 5) Duplicate entry: invoice identifier already exists in the destination file/database. Identifier: {identifier}. Amount from database: {amount}. Amount from processed mail: {amount}"

Process

Functional overview

Below is the overview describing how the system works on a functional level.

Steps:

New mail received in the invoice mail address (e.g. [email protected])
Every 5 minutes, the script checks for unread mails
For each unread mail, the script retrieves data from the invoice attachment: extracting invoice identifier and total amount
Extracted data is added to a Google Sheet
The unread mail is marked as read

Technical overview

Below is the overview describing how the system works on a technical level.

The script runs every 5 minutes with the built-in trigger system in GAS.
The invoice processing is done in Python with PyMuPdf and the OpenAI chat completion API: the pdf file is encoded as a base64 string, and passed to the OpenAI API to extract the information.
This invoice processing step is turned into an API with Flask and hosted on AWS with Lambda and API Gateway, so that it can be accessed with the /invoice endpoint on GAS through the UrlFetchApp.fetch() method.

See script files in the repository:

Steps:

New mail received in the invoice mail address (e.g. [email protected])
Every 5 minutes, the script checks for unread mails
For each unread mail, the script retrieves the attachment (invoice pdf file)
Invoice pdf file is converted to a pdf base64 string
Pdf base64 string is passed as a payload to the /invoice API endpoint
In the /invoice API, the pdf base64 string is transformed into a png base64 string with PyMuPdf so that it can be passed to the OpenAI completion chat API
Png base64 string is passed to the OpenAI completion chat API
OpenAI completion chat API extracts the invoice identifier and total amount and returns it as a json output
Finally, the /invoice API returns the invoice identifier and total amount
Returned data is added to a Google Sheet
The unread mail is marked as read

Diagram

Demo

invoice_demo.mp4

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
diagram.png		diagram.png
gas.js		gas.js
invoice_demo.mp4		invoice_demo.mp4
invoice_processing_workflow.ipynb		invoice_processing_workflow.ipynb
lambda_function.py		lambda_function.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Objective

Tools

Requirements

Process

Functional overview

Technical overview

Demo

About

Uh oh!

Releases

Packages

Languages

FlorianLD/invoice_data_extraction

Folders and files

Latest commit

History

Repository files navigation

About

Objective

Tools

Requirements

Process

Functional overview

Technical overview

Demo

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages