POC for an automated system extracting invoice data from mail attachments using computer vision, and sending the extracted data to a Google Sheet for further analysis by business teams.
Invoices are key elements in most business activities. However, more often than not, invoices are pdf files that are not considered "structured" data-wise: it can be a challenge to leverage invoice data for analysis purposes as it would require extracting said data beforehand, costing time and effort when the extraction is done manually.
That's where an automated data extraction system comes in handy.
When a company has a dedicated mail address for invoices (e.g. [email protected]), it can be beneficial to set up a system that automatically extracts invoice data from new mails. Extracted data can then be added to a Google Sheet for further analysis by business teams.
- Gmail
- Google Sheets
- GAS (Google App Script)
- AWS Lambda
- AWS API Gateway
- OpenAI chat completion API
- Python libraries: Flask, PyMuPdf
The following table lists the requirements taken into account for the system design.
Since this is a POC, not all the requirements were implemented in the development.
Number | Requirement | Description |
---|---|---|
1 | Process "unread" mails | The "unread" status is what indicates to the script that the mail is "new" and must be processed. |
2 | Mark "unread" mails as "read" after being processed | Ensures that the mail won't be processed again by the next script execution. |
3 | Extract mail subject | In the destination file/database, each record must contain the mail subject. The subject can be used to easily find a mail and perform a manual check if necessary. |
4 | Process mails with single or multiple attachments | When a mail contains multiple attachments, the script must be able to process all the attachments. |
5 | Process multiple attachments individually | When a mail contains multiple attachments, the script must be able to process every attachment individually, one after another. |
6 | Process an attachment with a single page or multiple pages | When an attachment contains multiple pages, the script must be able to process every page to extract the correct data. |
7 | Cancel processing if the attachment is not a pdf file | The script only accepts pdf file as input. |
8 | Cancel processing if the attachment is not an invoice | When passed to the OpenAI completion chat API, the model must check the input file to verify that it's an invoice. If the input file is not recognized as an invoice, the model must not return any data. Otherwise, the processing can continue. |
9 | Cancel processing if the invoice record already exists in the destination file/database | When the invoice identifier has been extracted, the script must check that the identifier doesn't already exist in the destination file/database. If it already exists, processing must be cancelled and no data will be added to the destination file/database. |
10 | Cancel processing if the data cannot be extracted successfully | If the OpenAI completion chat API doesn't return any data, then the processing must be cancelled, no data will be added to the destination file/database. |
11 | Cancel processing if the data extracted doesn't have the correct output format | If the OpenAI completion chat API returns data with an incorrect output structure, then the processing must be cancelled and no data will be added to the destination file/database. |
12 | Save error event in an error log when processing is canceled | When a processing is cancelled, the event must be saved in an error log describing the error type. Error types:
|
Below is the overview describing how the system works on a functional level.
Steps:
- New mail received in the invoice mail address (e.g. [email protected])
- Every 5 minutes, the script checks for unread mails
- For each unread mail, the script retrieves data from the invoice attachment: extracting invoice identifier and total amount
- Extracted data is added to a Google Sheet
- The unread mail is marked as read
Below is the overview describing how the system works on a technical level.
The script runs every 5 minutes with the built-in trigger system in GAS.
The invoice processing is done in Python with PyMuPdf and the OpenAI chat completion API: the pdf file is encoded as a base64 string, and passed to the OpenAI API to extract the information.
This invoice processing step is turned into an API with Flask and hosted on AWS with Lambda and API Gateway, so that it can be accessed with the /invoice
endpoint on GAS through the UrlFetchApp.fetch()
method.
See script files in the repository:
Steps:
- New mail received in the invoice mail address (e.g. [email protected])
- Every 5 minutes, the script checks for unread mails
- For each unread mail, the script retrieves the attachment (invoice pdf file)
- Invoice pdf file is converted to a pdf base64 string
- Pdf base64 string is passed as a payload to the
/invoice
API endpoint - In the
/invoice
API, the pdf base64 string is transformed into a png base64 string with PyMuPdf so that it can be passed to the OpenAI completion chat API - Png base64 string is passed to the OpenAI completion chat API
- OpenAI completion chat API extracts the invoice identifier and total amount and returns it as a json output
- Finally, the
/invoice
API returns the invoice identifier and total amount - Returned data is added to a Google Sheet
- The unread mail is marked as read
Diagram