11 June 2024

An AI-for-travel data tool helps this business process PDF documents faster (PoC)

Mariusz Richtscheid

Software Architect

3 minutes read

Contents:

Back to the start
1. Business challenges
2. Implementation
3. Improvements
4. PoC results

Share the article with your friends:

The travel industry struggles to organize transport, hotel bookings, tours, and excursion information collected from multiple providers. This leads to a flood of unstructured PDF documents. However, AI solutions for PDF data extraction streamline this process.

Business challenges

Our partner, a travel company, handled many unstructured PDF files with their user’s crucial trip details. These documents, often in different languages and inconsistent formats, require detailed handling to extract relevant information.

The key challenges include:

Varied document formats. Inconsistent formatting makes it difficult to extract information.
Language barriers. Documents in multiple languages add complexity to the extraction process.
Information overload. Essential details are scattered across multiple sources and files.
Manual processing. Manual extraction is extremely time-consuming and prone to errors.

The client expected a quick Proof of Concept (PoC) to see if automated data extraction is possible. The PoC validated the potential for automating the extraction and organization of critical travel data, setting the stage for a more efficient and streamlined operation.

Project objectives

To address these challenges, the project aimed to extract the following information:

booking numbers,
names,
flight details,
accommodation information,
car rental details,
transfer specifications,
excursion details.

The goal is to organize this information into relational objects to enable their seamless migration into databases to run analytical SQL queries and gain valuable data-based insights to optimize business processes.

AI-powered projects with real results:

Implemented solution

The project uses Python, the LangChain library, and OpenAI’s ChatGPT models. Here’s a detailed look at the two approaches we figured out.

First approach: custom prompts and MapReduce chains

Our initial approach relied on custom prompts and LangChain’s MapReduce chains. Initially, the Map prompt extracted essential details from PDF files, including accommodations, transfers, and rentals. Then, the Reduce chain was executed to convert the outputs of the Map prompt into JSON objects.

Challenges:

LLM hallucinations. The model generated incorrect data.
Problems with confidential data. The model sometimes refused tasks involving personal data.
Incorrect JSON format output.
Missing information. Data was often lost between the Map and Reduce phases.

Unfortunately, the results didn’t satisfy us, and crafting effective prompts proved challenging for consistent outputs. However, this technique worked for handling of numerous input files that exceed the LLM’s context size.

Second approach: built-in extraction chains and API Functions

For our next attempt, we used LangChain’s built-in extraction chains and OpenAI’s API Functions. To execute this chain, we needed to provide an object describing the schema of the properties we wanted to extract. We prepared many well-described fields for accommodations and transfer documents, resulting in accurate results. We manually selected only the necessary files, simulating a scenario where files were already tagged. We could automate this process or ask the client to provide only valuable files.

Well-thought-out key names and descriptions significantly impact the quality of returned data.

Example schema:

Note that this structure is designed based on the analyzed PDF files for a specific use case. You must prepare a different, customized schema for another data set.

Challenges:

Occasional hallucinations. While reduced, hallucinations still occurred (e.g., LLM incorrectly registered a car transfer that was not present in the data).
JSON format issues. Some JSON objects required manual correction.
Long processing times. After we changed the model from GPT3.5 to GPT4, processing a few files took several minutes.
Improved data accuracy. Processing may take a touch longer, but the accuracy of the results has increased significantly.
Data duplication. Instances of duplicated data occurred.
Library exceptions. Occasional exceptions halted the process

Further improvements

You cannot rely 100% on LLMs. While they provide value, you must always validate the results via reliable sources/methods, especially for critical information.

To upgrade the solution, we planned the following improvements:

Input data tagging and cleaning. Only necessary data is processed.
Output data filtering and validation. Checking for duplicates and hallucinated data.
Better field descriptions. Using detailed descriptions and enumerations.
Retry mechanisms. Implementing retries for internal exceptions.
Improved PDF parsers. Utilizing advanced parsers for better data extraction.

It took one week to validate ideas and plan for the future

Though currently a Proof of Concept (PoC), our project has already shown promising results in just a week of extensive research and experimentation.

We developed a partially working solution that effectively improves data extractions, setting the stage for a future without manual data work.

While our system was designed to extract accommodation information, we’re now upgrading it to process car rental data. Our next step is to create a compelling demo to present its potential and practicality.

Our partner will validate our solution using cutting-edge AI, advanced language models, and faster data extraction techniques. They can process travel information more efficiently and discover business insight through data-based analysis.

Don't let the paperwork slow you down—embrace AI in travel management

Book a free consultation to discover how our AI-powered tools will streamline your document management, automate tasks & boost your efficiency.

Book my consultation now

Mariusz Richtscheid

Software Architect

Software Architect experienced in backend technologies such as Node.js, cloud infrastructure, software architecture, and DevOps. Keenly interested in data engineering. One of the creators of Kakunin – our open-source framework for E2E testing. In his free time, Mariusz loves biking, machine learning, and 3D graphics.