Data professionals estimate that 60% to 80% of analysis time is consumed by data preparation and cleaning — not the analysis itself. This proportion is frustratingly high, but it reflects reality: data from the real world arrives inconsistently formatted, incompletely filled, and structured for entry rather than analysis. Whether the source is a PDF conversion, a database export, a web scrape, or a manually maintained spreadsheet, cleaning is almost always required.
These techniques address the most common data quality problems encountered in business data work, with particular attention to issues that arise when working with PDF-converted data.
Whitespace is the most common hidden data quality problem. Leading spaces, trailing spaces, and double spaces within strings all look the same visually but cause VLOOKUP mismatches, COUNTIF undercounts, and join failures. The TRIM function removes all leading and trailing spaces and replaces internal multiple spaces with single spaces.
Apply TRIM to any text column: =TRIM(A2)
After conversion from PDF, text cells frequently contain extra spaces because the PDF extraction picks up spacing characters that separate text elements on the page. Run TRIM on all text columns as the first cleaning step. Once cleaned, copy-paste as values to replace the formula with the clean text.
For non-breaking spaces (ASCII character 160, common in some PDFs), TRIM does not remove them. Use =TRIM(SUBSTITUTE(A2, CHAR(160), " ")) to replace non-breaking spaces with regular spaces first.
Inconsistent capitalization — "ACME Corp", "Acme corp", "acme CORP" — creates duplicates in grouped analyses that appear as separate entities. Standardize with one of three functions:
=UPPER(A2) — all capitals, good for codes and identifiers=LOWER(A2) — all lowercase, useful for email addresses and normalized comparison=PROPER(A2) — title case (first letter of each word capitalized), appropriate for names and company namesChoose the format that matches your standard and apply consistently. After applying the function, paste as values to convert formulas to clean text, then delete the original column.
Numbers stored as text — indicated by the green triangle warning in the cell corner — cannot be summed, averaged, or compared numerically. This happens frequently with PDF-converted data where currency symbols, commas, or percentage signs are embedded in the extracted values.
Several approaches depending on the cause:
=VALUE(SUBSTITUTE(SUBSTITUTE(A2,"€",""),",","")) removes the euro symbol and commas and converts in one stepAfter conversion, verify by applying a SUM function to the column. Text numbers return 0 in SUM; if your SUM matches expectations, the conversion was successful.
Dates from PDF conversions can arrive in almost any format: 15/03/2026, 03-15-2026, 15-Mar-26, March 15 2026, 20260315. Excel may recognize some of these as dates and store them as date serial numbers; others it will leave as text strings that cannot be sorted chronologically or used in date calculations.
Test whether Excel has recognized a date by formatting the cell as a number. A number like 46095 is a valid Excel date serial number; text like "15/03/2026" has not been recognized as a date.
For unrecognized date text strings, use DATEVALUE to convert them. DATEVALUE works on strings that look like dates in any standard format. For non-standard formats (like YYYYMMDD), parse with string functions:
=DATE(LEFT(A2,4), MID(A2,5,2), RIGHT(A2,2)) converts YYYYMMDD string to an Excel date.
Once dates are recognized as Excel dates, format consistently (Format Cells → Date → choose your preferred display format) and they will sort correctly in any analysis.
Duplicate rows corrupt aggregations and lead to double-counting. Excel's built-in Remove Duplicates (Data tab) handles straightforward cases, but business data often requires more nuanced duplicate detection:
Before removing duplicates, always check a sample of what will be removed. True duplicates should be identical; if the records differ in substantive ways, they may not be genuine duplicates and should be investigated before deletion.
PDF tables with grouped rows often have blank cells in the group label column — the label appears once at the top of a group and the remaining rows in the group have blank cells. After conversion, these blank cells prevent filtering and grouping from working correctly.
Fill down technique: select the column, use Go To Special (Ctrl+G → Special → Blanks) to select all blank cells. In the formula bar, type =A2 (the cell above the first blank) and press Ctrl+Enter. This fills all blank cells with the value from the row above. Paste as values to convert formulas to static text.
This is one of the most common cleanup tasks for PDF-converted data and takes less than a minute with this technique even on large datasets.
PDF extraction sometimes merges data that should be in separate columns — first and last name in one cell, address components in one column, product code and description combined. Split these using:
After cleaning, apply conditional formatting to validate the result visually before analysis. Create rules that highlight cells that still fail quality checks:
=ISBLANK(A2) — highlight red=A2<0 — highlight orange=A2>TODAY() — highlight yellow=NOT(ISNUMBER(A2)) — highlight redAny highlighted cells need investigation before the data is used in analysis. This visual quality check catches issues that would otherwise silently corrupt results.
Data cleaning decisions are substantive analytical choices — particularly decisions about how to handle ambiguous cases, outliers, and missing data. Document these decisions in a separate "Data Notes" sheet in the workbook.
Record: the original source and date of extraction, each cleaning step applied, any cells that were manually corrected and why, any records excluded and the reason for exclusion, and the date the cleaning was performed and by whom. This documentation enables reproducibility, supports audit trails, and helps future analysts understand why the data looks the way it does.
PDF to Excel conversion that preserves table structure gives you better raw material for cleaning. Try our converter for free.
Convert PDF to Excel for Free