Record Linkage and Data Integration for Maternal and Child Health Research

Record Linkage and Data Integration for Maternal and Child Health Research

January 17, 2020 0 By Kailee Schamberger


Dr. Linares: Good afternoon.
My name is Dr. Deborah Linares. I serve as a health scientist
and project officer in the Division of Research within the office
of Epidemiology and Research at the Maternal and Child
Health Bureau Health Resources and
Services Administration. The Division of Research
provides ongoing support for maternal child health,
or MCH, extramural research activity, including the Engaging Research,
Innovations, and Challenges for the EnRICH webinar series. You are joining a community
of more than 100 participants with an interest
in advancing MCH research. The EnRICH webinar series
provides technical assistance and methodologic updates
in-depth stimulating interest in applied and translational
MCH research. Today’s webinar is entitled “Record Linkage
and Data Integration for Maternal and Child
Health Research.” Before we start, I would
like to briefly introduce our speaker for this afternoon,
Dr. Russell Kirby. Dr. Kirby is a distinguished
university professor and Marrell Endowed Chair at the University
of South Florida. He is a perinatal
and MCH epidemiologist with training in human geography and preventative-medicine
epidemiology. In his 40-year career, Dr. Kirby
has worked on MCH issues in state health agencies and academic medicine focusing on
population-based research using most national and state-level MCH
secondary data sources. I will now turn the program
over to Dr. Kirby. Dr. Kirby: Okay.
Well, welcome, everyone. I’ve been watching as the names
pop up of participants, and there’s people I know and quite a few people
I don’t know. So welcome to the webinar. What we are going to focus on is record linkage
and data integration. And the problem statement, kind
of setting the frame for this, is that MCH professionals work, really, at the interface
of several different domains — public health, clinical care,
programs, education, and other kinds of social
service programs, as well. And rarely does a single
database include data on all of the phenomena
we might be interested in incorporating
into our analysis. record linkage is a technique
that we can use to link records on mothers
and children across databases, also over time, as well as
potentially multigenerationally. And data integration
provides a basis for the storage
of linkage results that we can then use
in future analyses. I always have to have
learning objectives. And I’ve got three here.
I think these are measurable. Hopefully by the end
of the webinar, you will all be able
to differentiate between deterministic
and probabilistic linkage methods and have some thoughts
on how to select the appropriate methodology
for whatever problem you have. You will be able to describe
frameworks for data integration of population-based
perinatal health data… and, hopefully, also identify
examples of research questions in our field that require
record linkage in order to obtain the necessary
data for analyses. We also have to have
a disclosure statement. And the long and short is, even though I’ve been working
this field for a long time, I don’t have any current funding related specifically
to this presentation, and, of course,
we harmed no laboratory animals in the creation of this talk. And, then, more broadly,
conflict of interest, which I need to disclose. Again, there’s nothing really
that has any bearing on this particular presentation. But I do have relationships
with the March of Dimes. I’ve worked for several
pharmaceutical companies on scientific
advisory committees for postmarketing exposure
and so on. But none of that has anything
to do with this presentation. So, let’s start
with the beginning. What is record linkage? This is a quote
from Ivan Felligi, and for those of you
who are interested in the history
of record linkage, look up Ivan Felligi on Google, and you’ll find that he’s really
one of the pioneers in terms of the use of
record linkage and statistics. Canada has,
for over a half-century, been one of the leading places
where methodology for record linkage
has been developed. But basically the idea is that if we assume
that there are records that relate to individuals
and relate to entities, then record linkage
is the operation using identifying information
we find in a single record that allows us to seek
another record in the file or another record
in another file that refers to the same
entity or individual. And based on that,
record linkage has been around
for a very long time. Genealogy is a form
of record linkage. you can even argue
that the Bible has some aspects
of record linkage to it, if you read it closely, as well. But in public health,
the modern methods that we use really only date back
to the 1960s, and the broad use
of record linkage in public health,
and in MCH more specifically, is really more of a phenomenon
of the 1990s up to the present. This is just an example of
another form of record linkage. I got this out of
the Birmingham news when I worked in Alabama, and they’re looking
at the organizational chart which has a lot of linkages and wondering
if it’s an org chart or maybe it’s a family tree. And you’ll have to be
the judge of that. But thinking about record
linkage more specifically, I’m going to walk us through
the basic questions of who, what, why,
when, where, how. And we will look
at each one of those and get a little bit of insight
into them. As to which one
is most important, I’m not actually sure which one. I think several of them
are quite important. But I think probably,
thinking about it from the standpoint of “why”
is a really good idea. The very first thing
that you need to consider if you are wanting
to get information from more than one database is, what is the purpose
of your study and does record linkage
really make sense? And although many people
who know me think I’m a guru
of record linkage, I always take the first step
of thinking about, could I actually answer
the question that has been posed without doing record linkage? Could I just do calculations based on examining
numerators and denominators or even calculating ratios
without doing record linkage? And if that could be done,
the amount of effort that you go through
to do record linkage may be much greater
than the information need of the question
that you’ve been asked. So you really
have to think about that. But then if you do decide
that record linkage is needed, you really need to think about, how can you structure
the record linkage so that the results
of the record linkage potentially can be useful
to other people? It doesn’t make sense,
for example, for eight different
research groups in a state all to be doing
transgenerational linkage of birth certificates to the birth certificate
of the mother, for example. That should be done once. An agency within the state
health department should store that information and potentially
make it available, as appropriate, to others, rather than redoing the same
linkages over and over again. We also have to think about
whether the record linkage is technically feasible. And sometimes record linkages that might sound like
a good idea really turn out not to be. And then, again, as I mentioned, about whether record linkage
is really necessary. Turning to “how,” there’s a variety
of different questions we have to think about
in terms of this. The first one is manual
versus automated linkage. And sometimes
a computer-assisted linkage where you’re looking at data
on a screen might be sufficient. I had a project
when I worked in Alabama where I had a list of children who had been diagnosed
with autism spectrum disorder, and I wanted to find
their birth certificates. And, you know,
the number of records was only a few hundred, and they all were born
in the same year. And rather than doing
record linkage with a computer-assisted
approach — actually merging the records
with a program — I arranged to go down to
the state health department and sat in
the Vital Records office and pulled up
the birth-certificate file on my screen
and did a manual search, and I was able to get
the whole job done in a day. I probably would have spent at
least a week of programming time to do the job if I did it
in an automated fashion, and I might not have had
as good a success rate. So you have to
really think about that in terms of making a choice. Another issue
you have to think about is what kind of methodology
really makes the most sense for your particular problem. And, in general,
record-linkage methods fall into the categories of
deterministic and probabilistic. I’m going to get into
some of the details, the differences of them, a
little bit later in the webinar. But the basic difference is that
with deterministic methods, you are trying to come up
with exact matches, whereas with
probabilistic methods, you are using weights and establishing
probability of linkages rather than exact matches. And both of them are
scientifically valid approaches. The probabilistic methods probably have a stronger
theoretical basis because they can relate back to a whole body
of statistical series based on probability. But they are both
valid methods to use. The major reason why we might
use deterministic methods is that we might have detailed personal identifier
information available, and if we have that,
we may consider more heavily using
deterministic methods. But you have to have identifiers
in order to do the linkage. And without them, you’re really
going to have a challenge. Basically, that means that
you need to have variables that can be found
in both of the databases that you want to work with and can be arrayed
in a similar fashion so that you can have a similar
way that the records are stored, similar variable names,
and so on. You also have,
in terms of record linkage, have some special challenges
that come up in terms of names and dates, and, again, we’ll talk
a little bit more about both of those later. Names are great, but names are not always
typed exactly the same. They’re not always stored
exactly the same. There are variations in spelling
that may be cultural and may have other aspects
to them. And dates are
potentially issues, as well. Sometimes you have
incomplete dates. Sometimes you have transposed
month and day and all sorts of things
like that. Finally, in terms of
the question of software, should you develop
your own algorithms? Should you buy
specialized software? Should you use
a statistical software package? That is a question
that comes up a lot. In this day and age,
there are a lot of options in terms of software
that’s been developed by government agencies that are pretty much
freely available for use by public health
specialists, but you can also
develop your own. Here at the Florida Birth
Defects Registry where I work, we actually use algorithms
that we developed and do them within SAS. But we’ve been evaluating and
improving them for 15 years. And, then, finally, if you’re
going to do record linkage, it is imperative that you
evaluate the linkage results. Irrespective of
how you did the method, you want to be able
to say something about how generalizable
your findings are, who are the people
who aren’t linked, and so on. Who should do
the record linkage? This is one of those things
that comes up, as well, in terms of the personnel. Should you have dedicated
linkage specialists? If you have programs set up,
can any statistician do it? And, then, of course,
the question of whether the linkage staff
should be subjected to personality profiles. Some of them develop
a psychosis that I call “the urge to merge disorder.” And maybe some of you
on the phone actually have that. I don’t know. But there’s also the question
when we talk about “who” of defining
what records are eligible to be included in the linkage. And, again, you have to think
about that very carefully because sometimes
the reason why a record may or may not be included in one of the databases
that you’re using could have an implication for whether they’re likely
to be found also in your target data set. So you have to think about that,
as well. Then there is the “what.” What databases should we link? And, again, that has to be
thought about very carefully. What are
functional relationships between the records in each
of the candidate data sets? And if you actually
achieve the linkage, does it result in data
that allows you to answer the question that
you’re doing the linkage for? And then, again, also,
how does the linkage support the needs for which
the linkage was proposed? And it is very easy
for the linkage to become the tail
wagging the dog. It can sometimes
become so complex and use up so many resources
that it becomes the primary goal when really it is
just a small piece of your larger research plan. So you want to make sure that
you treat it in perspective. And, then, in terms of “what,” having a plan for how
you’re going to store the data at the end is definitely
very important, as well. Then we have “where.” Where should the linkage
be done? I have seen it done
in a lot of different places. I have seen it done in the
health statistics agency. I have seen it done
in epidemiology agencies. I have seen it at
university research centers, sometimes contracted to those. I have also seen it contracted
to outside vendors. All of those
might be appropriate. I worry the most
about contracting it outside in that you potentially
lose control over the process and potentially don’t get back
all the information about the linkage and
the linkage files at the end. And, then, there is the question of where and how should
the linkage results be stored. And, again,
do the researchers keep or do public agencies retain? I think the public agencies
should retain. I think we should be building
repositories of linkage results so that, over time, we are able to build up the ability
to link across a wide array of different
data sets and domains. And, then, of course,
that requires a data structure that can support the storage
of that information, either a structure that maintains the common link fields that each record
might have been linked to or building a more complex
relational structure. And, then, of course,
the question of building full-linked files
or stored linkage identifiers. At the minimum, you have to
store the linkage identifiers. If you don’t do that, you really
don’t have the linkage anymore. And, then, of course,
I always like to point out, since I’m a geographer,
that geocoding is a form of record linkage. It’s a form where we take
locational information that might be present
on our records and link to either
an address file or other kind of
administrative geography file, and that enables us to then link our records
with other information. And, then, of course, when, how
often should linkages be done? I don’t have
a great answer to this. Here in Florida, we do it when
we have the resources to do it because it is
a fairly lengthy process. But some linkages probably
need to be done immediately. I would say
with infant deaths — I think infant death records should be linked
to birth certificates immediately upon being filed. There’s all sorts of
programmatic imperatives that require that, but
very few states actually do it. On the other hand,
the periodicity also can vary depending on when the various
data sets are actually created. Hospital discharge records might be provided
on a quarterly basis, but in some states they might be
provided annually. Likewise,
there could be registry needs. If you’re working
with a registry that works based
on impassive case finding, they may have a need to annually
re-create their data set. But they might also need to do something more frequently
than that. Other elaborate linkages might be done
a little bit less frequently. So, next, I have a few diagrams. I’m going to give you
a few examples of diagrams for data integration. And these are taken from
a variety of different programs that I have seen over the years, and we will talk a little bit
about each. This first one is one
that I put together, which is actually a model
that we use in Florida. Basically, this has to do
with linking data across pregnancies
and across generations. What we do is we link
the birth certificate records firstly to
hospital discharge records but also to mothers’
hospital discharge records, and we do that linkage
longitudinally so that we have hospital data
for the child going from the first year,
which is 1998, potentially up to the present. We do the same for
mothers’ hospital records, where we’re also able to link before the birth,
as well as after the birth. And, also, although we haven’t operationalized it ourselves, it’s a good idea to link the
birth certificate of the child to the birth certificate
of the mother so you can look at
transgenerational effects. If you’re really interested
in a life-course approach or setting social determinants, there’s a lot of interesting
things you can do if you make
that kind of linkage. But the final thing that we do
is we also link the birth certificates
across mothers so that we can identify
the sibship patterns… for individual mothers. Ultimately,
if I was able to access the individual record
of education data, I would be able to build
educational outcome profiles that could be studied
within families, for example, but I don’t have the ability
to do that yet. But that’s the direction
that one might go. This is an example
of the PELL data system. This is from the state
of Massachusetts. And PELL is Pregnancy and
Early Life Longitudinal study. It is very similar
in the middle in the core, with linking vital records data
to hospital records. But then it moves out and links with a variety
of other program data — newborn hearing screening,
Birth Defects Registry, program participation data,
death certificates, hospital services utilization, and other sources. This is an example
of how one might use the kernel
that I just showed you and then link outward
to other databases, as well. This one is from
the state of North Carolina that shows, for their birth
defects monitoring program, where they have
a central registry, which is conveniently
in the middle, and they link to vital records,
birth and death certificates. They also link
to Medicaid records, both for mother and baby,
but they link to a variety of other service programs — early intervention, WIC,
child service coordination, Health Department records,
clinical records, and so on. That is another example
of linkage where they start with the
birth defects record registry and then kind of
link out from there. And, then, this is a diagram. I don’t want you
to pay attention to any specific element of it, but this is a diagram
for the state of Florida that shows the variety
of record linkages that we do. Some of the ones that
I showed you a few minutes ago are embedded in this diagram. But the idea is to link
the vital-records information but link out to hospital,
link out to program data. We have a program called
Children’s Medical Services. Link to birth defects
and cancer, link to WIC, link to the Florida State
Healthy Start data, and then link to a variety
of other programs that we use a lot in MCH
like PRAMS, immunization registry,
and so on. Down here, I wanted
just to call out Florida, for a number of years,
participated in a program that CDC manages
called the SMART, the States Monitoring
Assisted Reproductive
Technology, and this was the program
where the state sent individual identifiable records
to the CDC, and they were linked
with records about assisted reproductive
technology, and that supported a lot of interesting
population-based research on outcomes of A.R.T. This is not a diagram
of databases, but it’s worth thinking about in terms of what direction
one might be going. If you’re thinking
about child health, there are really
three major domains that you need to be thinking
about in terms of child health, in addition to looking at things
like diagnoses and so on. Firstly, children are growing,
and you want to think about how we can measure their growth
and take stock of that in a number of
different dimensions. But the child
is also developing, and the child’s development
doesn’t occur necessarily on a linear path. Children who have disabilities or other kinds of impairments
or delays may not be progressing
as quickly as others. And then education. The child is learning like crazy
throughout early life and into early childhood. And these are
overlapping domains that need to be thought about. We don’t usually,
in public health, have databases
where we can measure all of these, particularly
on a population level, but we should always be thinking
broadly about child health whenever we’re considering it. And just to that end,
I have tried to put together what databases might look like
that you could try to capture some of that information for
child health in this diagram. The kernel of this diagram
is what I call the Kirby Master File,
or the KMF. I did not make up this name. It was made up
by an assistant administrator at the Wisconsin Division
of Health about 30 years ago when he was exasperated
as I kept talking about this. But the idea is that
you have as your kernel a database that includes
information on all the children. And ideally it goes back
to their birth certificate. If the child was not born
in your jurisdiction, you want to create
some kind of a dummy record that has enough
identifying information so that you can still link
to other data sources. And, then, of course,
these records should be linked to death certificates
and pediatric cancer cases. We should be linking
to hospital discharge data, if we have ER
and other outpatient data. If you happen to have
all-payer claims, go for it. That should be linked in here,
as well. But, then, the part that is a little bit more,
I guess, proactive is thinking about what kind
of educational data could we potentially link
with our records. And so we have
the child’s health status and school readiness
at school entry. And there is a lot of
different names that people have for what that might be,
but that is very important. Data on special education
placements and what kinds of reasons why children are
in special education. And then
educational outcome data are all important to be able
to link together with this. And, then, of course,
we need to be able to link with birth defect
surveillance data and other data that might
measure special needs. I’ve got that down here. Then, of course, developmental
disability surveillance if that happens to be happening
in your state. So that’s a model
that you can think about for how to integrate data for looking at child health,
growth, and development. This is a diagram that
we published in our textbook on perinatal epidemiology. This was published
about 10 years ago. And, again, there’s some
additional databases that I did not touch on
that are also included in here. I think I have
immunization registry, child abuse and neglect,
Child Protective Services, blood lead screening, potentially developmental
disability services, if your state has, and, of course,
newborn metabolic screening and hearing screening
and so on. So there is a whole array
of domains of data that we can potentially link. But the key thing is you need
a central holding place for it. And one of the cool things about
this, if you think about it — Let’s say hypothetically
that you have linked the immunization registry
with birth certificates and you have also linked
the early intervention data set with birth certificates. By virtue of that linkage,
you can also evaluate aspects, for example, of are children
who are in early intervention up to date
on their immunizations? Are there opportunities where
we could make improvements? By virtue of the fact that you have them linked
to a common data set, you also can create a data set where they are linked
to each other. And this is a diagram
that I just received a few weeks ago from
my colleague Kay Johnson, who many of you may know, and she was thinking about
developmental screening and all the kinds of programs
that might be relate to developmental screening
and put this together. And, again, you can see
that there’s a number of different
public health programs that play a role in terms
of developmental screening. And I’m not sure exactly how
one would operationalize this from a linkage point of view, but it’s definitely
something to think about in terms of looking
toward the future. A few broader thoughts and
concerns about record linkage. There’s the question of
how might we incorporate our integrated health records
and integrated databases. And we have some challenges
with this. Right now a lot of states
are wrestling with the problem of Neonatal Abstinence Syndrome,
or N.A.S. Some people call it NAS. Some people look
a little more narrowly just at opioid
withdrawal syndrome. But this is something that a lot
of state health departments are wrestling with right now. And the question then becomes, how can we put together
a better approach to studying N.A.S.? One of the problems
is that if all we have is hospital discharge data,
we are only going to have information about a diagnosis which may or may not actually
be a valid diagnosis. We don’t have information about
how the diagnosis was made. But we don’t know very much
about treatment. We don’t know very much
about long-term outcomes. And we have to think about how can we pull, potentially,
clinical records in so that we can look at that
more systematically. And, then, there’s the question, what can we learn
from health claims? There are a few jurisdictions that have all-payer claims data,
or APCD. But, again,
how do we use these data? What are some of the methods
we need to use for linking them
with other sources and for analyzing them? I’ve done some work
with claims data where the major challenge is that we don’t have the
sociodemographic information that we’re used to having
in our public health databases. And if you link those records
to public health databases, then you can retrieve
that information and be able
to analyze it better. Then I mentioned
about data structures. The data integration
is really important and should be thought about
at the beginning of a project and not as an afterthought because the way
that you do your linkage or the things that you want to
actually use the linkage for to some extent are related
to how you store the data. I’m going to move on.
I just have this quote. “If you always do
what you always did, you will always get
what you always got.” And that is probably true in a lot of other aspects
besides record linkage. I wanted to spend
just a few minutes talking more specifically
about research methods and, in particular, there’s two different classes
of methods — deterministic and probabilistic. And we will take
a look at both of them, starting with
deterministic data linkage and look at some of
the key issues and concerns when we use this approach. So, firstly, if you want
to do deterministic linkage, you need to make sure
that you have variables that are common
to both data sets that are stored
in a similar manner and coded in a similar manner. So, again, my examples here
are primarily based on SAS, which I find
a very useful environment for teaching principles
of record linkage… …but do a PROC Contents
and look. And then I just want to make —
This is just a huge caveat. I can’t express it often enough. If you are using
a relational database software where you think
that there are variables and different tables that represent data
from different sources that you can just do a join,
don’t. You need to do a lot more
looking at the data before you are going to get
any kind of useful result from that. There are just so many
differences in the way that information is stored, that you will wind up
with something that is not a very good result. And likewise,
if you are just going to use a single identifying variable or require a match on that variable together with others,
don’t. You need to gain strength from multiple variables
in your linkage. Social Security number is great. It’s carried on a lot of
health records, and it’s great, but if you rely only on
the Social Security number, there is going to be a sizable
proportion of your data set that you are not going to
be to link at all, and you might even
throw them out because you could not link them,
and that is not good. Because typically what we find
is that the unlinked records in any record linkage are often very interesting
and important. Sometimes they’re
very high-risk cases, and throwing them
out of the database leaves you with
an incomplete understanding of the nature of the problem
you’re studying. But on the other hand, once
you have linked the records, creating a common identifier
that you store in both data sets would be really great because
then you would be able to put the data sets together readily
in the future. Okay. This was a little aside. Let’s say we have
two different data sets. We have a birth-certificate
data set. We have a newborn-screening
data set. These are examples
of four different variables that we might have
on both of the data sets. We might have information
about the mother’s name. We might have information about
the date of birth of the child. But if we want to link on these, we have to make sure that these
are all stored in the same way. When we’re looking at the names,
for example, maiden name is another variable
that might be there. It’s possible that
in one of these data sets the maiden name has been stored
rather than the legal last name. We have to make sure
that the data are the same, and we have to make sure dates
are formatted in the same way across both data sets, as well. Usually, I find
I have to break the dates down into month, day, and year and recast them
in a different metric in order to be able to assure
that I am linking correctly. And, then,
we also have, similarly, variables that are common
on the child. The last, middle,
and first name of the child, the gender, the date of birth. Maybe the newborn screening
doesn’t have the date of birth, but it might have
the screening date, and you can make some
assumptions about dates that can potentially
enable you to link there. Another important variable
for this linkage is the hospital of birth, which can be very useful
in blocking your analysis. In fact, here it is here.
Hospital. And, then, of course,
the ZIP code of maternal residence
is another. These are all fields that
potentially could be useful in doing
a deterministic linkage. What we next have to do
is to look for missing data in the linkage variables. And you know what? If you have a record
in one of your data sets that’s missing on all of the attributes
you’re trying to match to and there are other records
in the other data set that are missing
on all the attributes, they’re going to match because
they match on being missing. So you have to think about, what
do you want to do with that? But, again, the decision about
how you want to handle that, what people sometimes do
is they do a sort on the set of variables they are
trying to do their linkage on and exclude any records
that are missing but don’t throw them away. Store them in another data set
where you can pull them back in for further processing later. But you do have to be concerned
about how you handle that. And, then, again,
you’re looking for records that share the same values
in each of the databases. And then when you find records
that do share the same values, the question is,
what do you do with those? And you also
have to think about, what variables
are the best variables? And, again, there are
some scientific approaches to thinking about this, but, really,
common sense can help a lot. If you have a variable
that has a lot of missingness, probably not a great choice. And, then, again, what do
you know about the variables? What kind of information
do they have? How specific is it? You have to think
about those kinds of things. We always recommend that you use the most discriminating
combination of variables first and loosen the criteria
as you go along. If you think about it, in terms of most strict
to least strict — If you think about it, gender is not a particularly
useful discriminating variable because in babies,
almost all of them are going to fall
into one of two categories. There’s a little bit of noise
in terms of that, but they’re going to fall
into one of two categories, and that’s not going to help you
very much in linking the records because it doesn’t block you
very specifically. But you want to look, again,
at the variables and make decisions
based on that. And then you want to start with your most strict criteria
for linkage. If you can get a perfect match on a vector of five
different variables, that is probably going to be
a pretty good match. On the other hand,
if the record falls through and only matches
on linkage step six, where you have a perfect match
on only two of the variables, you’re going to need to evaluate
that record more closely to see whether
it truly is a match. And, again, you start
with the most strict and you go to least strict. Again, you always
want to set things up so that you can merge back
with the original data set. And you have to create
some kind of an I.D. number that you retain
across all the records no matter where they end up
in your process because you can very easily end
up with a large number of files that you are working with and you want to make sure
that you don’t end up with multiple copies of the same
record in the data set. And, again, really important
to use five variables. I’ve just made up
this example here. Let’s say we have
a birth-certificate file and some kind of
medical record file, and we just say
DATA LINKED, MERGE BCERT MED. You know what’s going to happen
if you do that without a bi-statement? You’re going to end up with — If you say the birth-certificate
file has N records and the medical record file
has M records, you’re going to generate
N-times-M records because every
birth-certificate record is going to match
with every medical record, and that’s not going to be
very useful. Bivariables are essential. And you want to use them
in a way that allows you to discriminate across
the different variables that you want to match on. It’s also very important
to unduplicate. You actually need
to unduplicate your files before you’ve linked
and then after you’ve linked. And in SAS, the NODUPKEY
is a useful key. And there’s also a way that you can export
the NODUP records so that you can use them
for future analyses. Multiple births
are a lot of fun, and that’s frequently a group
we need to process separately from the rest of
the birth certificates, and, again,
we can subset them out here. And then you want to merge by the linkage variables
that you have chosen. You want to create a data set that only has
the linked records, keep track of what link level
the records merged on. I’ve seen
deterministic algorithms that have as many as 250
different steps in trying to come up with different combinations
of variables. But you don’t want to throw away
the records that fail the match
at each step. You want to pull them in
so that you can reanalyze them. And, also, if you remember
back to the probability class that you had in your
introductory statistics course, consider full replacement. It might be better if you
actually run the entire data set through each link level. Then you’ll be able
to get much more information about what links are possible. Problem is, it might actually be that you have
a transaction record — say it’s a newborn-screening
record — that wants to match to more
than one birth certificate. If you pull out records
at the first match, that record doesn’t get a chance to possibly match
with another record later on. But sometimes
there are errors in our data that can potentially
allow that to happen. So you want to be thinking about
all the possibilities. Okay. Then you want to put
everything back together. You merge it all back together. And what I typically do is create updated
unlinked records to go to the next level
of the linkage algorithm. And then put it all together. Always study
the unlinked records. They are very interesting. They may actually tell you more than what you learned
from other things. Be looking for bias,
looking for systematic errors, hopefully things that
you can potentially correct. And then you want
to evaluate the quality of the records
that you actually have linked. I don’t ever want to see
anybody on this call publishing a report
about record linkage where they don’t tell me something about
the linkage itself. I don’t want to see
a method section that says, “We linked hospital
discharge records for infants with their birth certificates
and analyzed 122,422 records.” That doesn’t tell me anything
about which records didn’t match and what you did to try to
learn more about your data set. Epidemiology is all about
understanding what our
reference population is, and if you don’t give me
that information, I don’t know. So make sure you do that. Okay. We are running
a little late on time. I wanted at least to spend a little bit of time
on probabilistic linkage, so I’ll do that really quickly. The idea here is that
we use probabilities to determine whether
a particular pair of records, one from each data set,
refer to the same individual, and we calculate weights
to quantify the likelihood that a particular pair
are a true match. This is computationally
intensive because we’re
basically comparing each record in the data set with every other record
in the data set. Again, our probabilistic weights can be either
nonspecific or specific to particular values
in the data set. Thinking about nonspecific,
we may be just looking for agreement
on a particular variable. For example, direct agreement
on date of birth gets a higher weight
than match on sex. Again, date of birth
is much more specific. There’s 365 different,
or 366 in a leap year, compared to sex, which
typically only has two values. But then a disagreement on sex should get a higher
penalty weight than a disagreement
on date of birth. That makes sense, I hope. Value-specific weights
relate to particular variables and the values
that they might hold. For example, the letter Z
might get a higher weight than the letter S, but disagreement on the letter S might also be given
a higher penalty than disagreement
on the letter Z. The weights allows us
to objectively reflect our confidence in a match. But there is
individual choice involved. And we also have to think,
when we’re all done, what is the overall score that
we want to set as our cutoff for throwing out matches that are not sufficiently
high probability? And that is
a subjective activity that you have to really know
your data to understand. Now, probabilistic
linkage methods. A lot of people
write their own programs. There’s a lot of packages
out there. Some of them are expensive
and difficult to use. Pretty much all of them are
actually a little complicated until you understand the basics. Some of them
are actually available as freeware or shareware. And I have this slide here. Don’t pay any attention
to the dollar amounts. I haven’t really researched that
in a long time. But Automatch is a program that has been around
since the 1990s. It’s very expensive. Top-end corporations use it. It’s probably built
into Oracle software. GRLS is another program that’s
been around for a long time. It’s built into Oracle’s
healthcare software. But there are number of other
programs that are out there that a lot of people use. LinkPro is a program
that is commercially available. There’s also a version of it
called Links. This is the software
that was initially developed by the University of Manitoba for their integrated
provincewide database. But there’s a whole bunch
of other programs. And I think somebody
asked a question about what CDC
might have available. Link the King and Link Plus are two examples of software
out there there were developed
either by CDC or Samsung. There are others, as well. There’s also open-source
freeware. FEBRL is a program
that was developed at Australian National
University. FRIL is a program
developed at Emory but with input from people
at CDC’s Birth Defects Center. These are very flexible
programs. They’re not ideal. They take
quite a bit of time to learn. But they can generate
very high-quality results. So you have choices on that. Finally, linkage evaluation. I did want to spend
just a minute on this. It’s a lot easier to do
linkage evaluation with probabilistic methods
because you’ve already built in the process for it
into your algorithms. But you still have to decide
at what level of tolerance will you accept matches, and that, again, is something
that is subjective. There are ideas out there in the health information
management literature, but you really have to make
your own decisions on that. And, then, again,
document, document, document. I can’t tell you how many times
I have encountered colleagues in state health departments
who have been hired — I know a number of people who
have been hired specifically to do record linkage,
and they get there, and they have to
start over from scratch because nobody’s kept
the metadata that are necessary to enable them to do their job. So, document, keep track
of everything that you do, and create kind of
a resource book that outlines
all of your methodologies and things that you learned
from doing it this time that you’re going to improve
next time. Document all of that
because you might think that if you go through
this process once, next year, you can just run
the whole program script, just change a couple of dates. And it doesn’t work that way. You really have to be
paying attention at every step
through the process. We talked about linkage
and data integration. Are they required? The answer is maybe, maybe not. A few things
we need to think about. How precise is the need to know? Can the question be answered
through calculations? Do we actually have
individual-level records that have
appropriate identifiers and could get the necessary
permissions to do the linkage? How accurate do we want
our linkage to be? What’s our match rate that
we’re trying to achieve? Do we have
the resources necessary to conduct and evaluate
the linkage project? Do we have the resources to analyze the data
when we’re done? Can we store the results
so that they can be used later and potentially be used
to inform future analyses? All of those things
need to be thought about. I did want to give you
my contact information if you want to contact me. You can also give me a call. And I will be happy to follow up with anything that you might
want to ask me. I’m going to turn things
back to Deborah. I think we might have run
a little over. But I’ll turn things back. Dr. Linares:
Thank you so much, Dr. Kirby. What an informative and
interesting presentation. We really appreciate
you taking the time to share your expertise
with the MCH community. We are now ready for the
question-and-answer period. Our first question is,
“How feasible is it to link PRAMS data with
insurance claims data? Ideally one year
before pregnancy? To three years
after delivery?” Dr. Kirby: So, asking about
linking PRAMS records. The first thing to note about
that is that you probably don’t want to link PRAMS records. You want to link
the birth certificates associated with
PRAMS respondents because the PRAMS record itself isn’t going to have identifiers
that will enable you to do that kind of linkage. You really have to think about
the nature of your inquiry to decide what kind of window
you want to work with for linking back
to health care records. Firstly, it depends on what health care records
you’re working with. If you’re working
with Medicaid records, you want to parse through
the Medicaid data set and make sure that the mother was continuously eligible
for Medicaid during the time period that you
are trying to link records for. That’s going to subset things in possibly some ways
you don’t like, but you’ll get incomplete data
and potentially unusable data if you don’t do that. If you’re working, say,
with a health plan — Say, hypothetically,
you might be in California and you are linking with the Kaiser Permanente database
of Northern California, where you have a bunch more
stable health-care population, that would probably work better. But you really need to think
about the specific nature of the inquiry to decide. For example, for preconception
health-care issues, it depends on what it is,
as to whether you want to look just at three months
prior to conception or a full year
prior to conception or, even if it’s a woman
who had a previous birth, trying to go back to right
after the previous delivery. But it really is going to depend in terms of how
you want to do that. Dr. Linares: Great.
Our next question is, “If researchers are interested
more in preconception health, pregnancy-related outcomes,
and newborn health, are there any linked data sets
available that you would recommend
researchers to use?” Dr. Kirby:
That’s a really good question. I’m going to say
it kind of depends. And the unfortunate thing is that the United States is way behind many other
Western countries in terms of thinking
about data integration that would support
that kind of research. We do have some sources
that could be useful, but oftentimes
they are databases that are hard to link. So, for example, I don’t know
how many people on the call are familiar with the
Listening to Mothers surveys. And there’s a third wave. I think they might actually be
doing a fourth wave right now. But this is a survey of women
who have recently given birth, and it collects
a wide array of information about the birth experience and how they interacted
with health care providers and has some information
on exposures and what kinds
of health procedures they had during the pregnancy
and so on. But you really
can’t link it to anything because it’s a sample survey
that doesn’t have any personal identifiers. So thinking about it
gives us a lot of insights, but it doesn’t necessarily
get us linked back to programs. And likewise, we have
a wide array of programs that collect information
on early childhood and infant and toddler care. States have
home visiting programs that are collecting data
and so on, but it doesn’t necessarily
enable us to look holistically. There’s a lot of things
we can do with it, but, comprehensively,
looking from preconception care to early childhood, outside of being within
one of the staff-model HMOs, is still a challenge. I’m not going to say
that there aren’t any longitudinal research programs that would have
those kinds of data. There probably are some. But for the kind of things
that we do on population-based
public health, it’s a bit more difficult. But we can always
look to the future. It could be down the road. Dr. Linares: Great.
Our next question is if there’s any state data sets linking with the National Survey
of Children’s Health, especially for children
with special health care needs. You mentioned earlier some
data sets in your presentation for children with
special health care needs and special education data. Dr. Kirby: Exactly. And the National Survey
of Children’s Health was actually created in part
to fill the gap because of the fact that we had much more limited data
on special needs than many state
Title V programs needed in order to understand
their population and evaluate their programs. But the problem is the National
Survey of Children’s Health is, again,
a representative sample survey. It’s publicly available.
It doesn’t have identifiers. I’m going to say it’s
theoretically possible, working with the Maternal
and Child Health Bureau and the Bureau of the Census, you might be able
to get permission to do a record linkage with vital statistics,
for example. But I have not heard of anybody actually
trying to do that, and I would think that,
from a lot of perspectives, that would be
a pretty challenging thing to actually do. What you can do, however, is build a state-level database about characteristics,
health care characteristics, economics, and other kinds of
factors at the state level and use that information
in a multilevel analysis where you use the information
about the children nested within their state
to do that kind of analysis. But in terms of linking
the NSCH directly on a record-level basis, I don’t really think
we can do that. And I don’t know
if there’s somebody from MCHB who wants to chime in,
but I’m pretty sure that would be
very difficult to do. Dr. Linares: Thanks, Russell. We can inquire about that
internally and get back to the person
who asked that question. So, our next question is, “Do you have a recommendation
for any linkage data sets to study the outcomes
for children born to women with and without mental health
and substance use issues?” Dr. Kirby: Ohh.
That’s a really good question. It depends on what kind of
programs the women are in. There actually are databases. SAMHSA has databases. Many states have databases relating to
substance-abuse treatment that potentially
could be used for that. Here in Florida,
I have colleagues who have access
to Medicaid data, and we’ve actually been
looking at the association between mental health and substance use and risk for neonatal
abstinence syndrome. The problem is we only have — Again, it’s claims data. We don’t have a lot
of demographic information that we can work with. And because
of the nature of those data, the researchers who have them
are prohibited from linking them
with any other data sources. But I think there probably
are similar databases to that available in many states, and it’s certainly
worth exploring. Probably the best thing to do would be to talk with
the state-level agencies that administer mental health
and substance abuse programs in your state and learn about
what kind of data sources they have available
and kind of see from there. I do know that in Florida our
analyses would be much enriched if we had the ability
to link those records with birth certificates,
just for one example. Dr. Linares: Great.
Thank you so much again for a great presentation
and for sharing your expertise. If you did not get a chance to ask your question
for the speaker, please still feel free to submit your question
through the Q&A field. We’ll try to respond to your
questions after the webinar. We are now almost
at the end of our program. After this webinar,
you will receive a request to complete an evaluation. We hope that
you will fill this out and provide the MCH
Division of Research with feedback on today’s event. Your response will help us
plan future webinars in the EnRICH series. Thank you all for your
attendance and participation. I also want to thank Jen Rogers, Rebecca Harnik, and
Jim Wetherill at Altarum for helping
to organize this event. An archive of today’s webinar
will be available on the Division of Research
website in several weeks. Have a wonderful afternoon,
everyone.