When dealing with multiple data sources for a database or set of data that is important to the survival of your business, you need to be able to process the data efficiently and accurately. The most difficult data sources to process are printed paper and PDF’s. A lengthy process of scanning, putting through OCR software, or manually entering data is not efficient or accurate. So what is the right thing to do? Who should you turn to? Let us share with you a real situation that happened with one of our clients.
We were given a pdf file that contained 219,121 pages and 2,848,401 records that needed to be loaded into a database. The data in the PDF was very valuable to the company and needed to be processed quickly. They didn’t have days, weeks or months to get the data into their system. Their competitors were definitely dealing with the same file, and if they figured it out quicker, they could lose a lot of business.
The file was not exported from excel or from a text editor, but rather looked like it was some sort of export from a custom built database. There were no out of the box solutions that could parse the data efficiently or accurately.
The file looked like this:
The PDF file was arranged very nicely, but a simple copy and paste into notepad did not allow for loading into a database. The reason we could not use a simple copy and paste is because PDFs have a different way of organizing text information. A simple copy and paste operation is not capable of extracting all the “metadata” used to visually represent the text in an organized manner.
Manually copying/pasting the PDF line by line could take days or weeks. At Clevertech, we have been dealing with complex data integration for nearly 12 years, and we knew how to come up with a process to solve this problem. We were able to implement a solution that automates this process to take minutes instead of hours/weeks. Our sophisticated tool is capable of “arranging” the text information from the PDF into an easy to manipulate form. The extracted data can easily be imported into an Excel spreadsheet, or into a MSSQL/MYSQL or Access database for further processing. The next time the client receives a complex PDF – they will have no fear of losing valuable business.
Below is a small screenshot from a large PDF file that contains 219,121 pages and 2,848,401 records:
Here is the result of the entire file, notice the Row Count – 2,848,401 records.
At Clevertech, we do love to take on the complex projects. We define complex projects in the same way that Geoff Smart and Randy Street discuss in their New York Times Bestseller, “Who: The A method for Hiring“. They describe complex projects as a set of outcomes that only 10% of possible firms can achieve. Clevertech takes pride in being a firm that strives in successfully taking on complex projects and achieving the most efficient and accurate outcomes.