Hello Messorians,

Every health data analyst I know has a moment, usually in the first three months on the job, where they open a dataset, expect it to be clean, and instead find a swamp. They check with their manager, sure they've been given the wrong file. The manager says no, that's the real data, welcome to healthcare.

This is the part of the job nobody explains in advance. Healthcare data is not just complicated. It is structurally, systematically, and almost philosophically messy in a way that other domains are not. If you don't understand why, you spend your first year angry at the data. If you do understand why, you spend your career being the person who can actually work with it.

Here are the 6 reasons healthcare data is the way it is.

1. Duplicates, because identity is harder than it looks.

The same patient might exist in your dataset three times. Once under their maiden name. Once with a typo in their date of birth from a busy registration desk. Once as a duplicate MRN created when two hospitals merged and never reconciled their Master Patient Index.

This is not a bug. It is what happens when you let humans type names into a computer at the worst possible moment, often while the patient is in pain. Every health system has an MPI problem. Every analyst has to learn how to spot one.

2. Missing values, because reality is harder than the form field.

A vital sign is missing because the visit ran 40 minutes long and the nurse charted at the end of shift. An allergy field is blank because the patient could not remember and the nurse charted "unknown." A smoking status is NULL because nobody asked. A discharge disposition is "Other" because the clerk did not have the information at the time the form had to be submitted.

The lesson: every missing value in a healthcare dataset has a story. Sometimes the story is benign. Sometimes the story is the entire reason your analysis is biased. Treating NULLs as "no data" instead of "no answer" is one of the fastest ways to ship a wrong report.

3. Inconsistent coding, because there are too many code systems and not enough discipline.

The same diagnosis can appear in your dataset as an ICD-10 code, a SNOMED CT concept, a free-text problem-list entry, and a phrase in a clinical note. The same lab test might have three different LOINC codes across three different lab vendors. A medication might be coded by NDC in one table, RxNorm in another, and brand name in a third.

This isn't because anyone is being lazy. It's because healthcare uses different code systems for different jobs (billing vs. clinical documentation vs. interoperability), and the mappings between them are not perfect. A query that says WHERE icd10 = ‘E11.9’ misses every patient whose diabetes is documented anywhere but the ICD column. Knowing where to look is half the job.

4. Changing definitions, because the rules keep moving.

The clinical definition of hypertension changed in 2017 from 140/90 to 130/80, and your historical dataset still reflects both. CMS updates the readmission denominator every few years and last year's report's numerator no longer matches. "Length of stay" is calculated as discharge date minus admit date in one system and as discharge time minus admit time in another, and the two answers differ by 0.7 days on average.

Healthcare analysts spend more time defending their numerator and denominator than they spend writing SQL. The reason is not that the numbers are wrong. The reason is that "the right number" depends on which definition was in effect when.

5. Fragmented systems, because the patient sees more places than any one EHR knows about.

A single patient might generate data in primary care (Epic), a specialty clinic (NextGen), an inpatient stay (Oracle Health), a reference lab (separate LIS), an imaging center (separate PACS), and a retail pharmacy (separate system). Each one has its own ID for that patient. Each one stores their data in its own format. TEFCA is starting to fix this at the network level, but the inside of most health systems is still six tables that disagree on what the patient's last visit was.

If your analysis depends on having "the full picture" of a patient, you need to know which systems your dataset can see and which ones it cannot. Every blind spot is a place your conclusion could be wrong.

6. Human workflow issues, because the data is a byproduct of how clinicians get through their day.

The nurse charts at the end of shift, so the timestamp on the vital sign is two hours after the actual measurement. The provider copy-pastes the previous note and forgets to update the pregnancy status, so the chart now says an 82-year-old is pregnant. The coder picks the diagnosis that maximizes reimbursement, not the one that best reflects the encounter. The alert fires so often that everyone dismisses it without reading.

Clinical data is not collected for analysts. It is collected as a side effect of busy people trying to take care of patients in a system that does not give them enough time. Every quirk in your dataset is a quirk in someone's workflow. The best analysts work backward from the data anomaly to the workflow that caused it, and that's the moment they stop being data people and start being informaticists.

The takeaway.

If healthcare data was clean, anyone with a Python tutorial could do this job. The reason healthcare data analytics is a real career, with real demand and real compensation, is precisely because the data is this hard. The mess is not the obstacle to your job. The mess is the reason your job exists.

The analysts who do this work well are not the ones who can write the fanciest queries. They are the ones who know why the queries have to be written the way they have to be written. That's the muscle to build. The SQL is teachable in a week. The instinct for why a number is wrong takes years.

You're building that instinct every time you sit with a confusing dataset and ask why instead of giving up.

Hit reply and tell me the wildest piece of dirty healthcare data you have ever had to clean up. I'm building a collection.

Sharif
Founder, Informessor

Keep Reading