Understanding Results

During the run, Name Match will log information about the matching process: tracking execution, reporting performance metrics, and flagging any issues. In addition to printing in the console, this log is written to file and can be found at output/details/name_match.log.

The log contains several different metrics indicating how successful the match was, and should be checked after the match is finished to ensure high quality. Below is a breakdown of the terms and metrics you will see – what they mean and what values are reasonable.

Blocking

Terms:

  • True pairs: pairs of first_name/last_name/dob values that we know refer to the same entity based on the UniqueID

  • Covered pairs: pairs of first_name/last_name/dob values that make it through the blocking stage

  • Uncovered pairs: true pairs that don’t make it through the blocking stage (the fewer the better)

Metrics:

  • Pair completeness: share of true pairs that are covered (the bigger the better, max 1)

    • Including equal blockstrings : 0.90 in our experience with arrest records

    • For non-equal blockstrings: > 0.75 in our experience with arrest records

Modeling

Terms:

  • Threshold: which P(match) – phat – threshold is being used to classify a pair as a “match” vs. “non-match”

Metrics:

  • Base rate: what fraction of record pairs with ground truth labels are a match?

  • Various typical classifier performance metrics (e.g. precision, recall, f1, auc): out-of-sample metrics reported. These metrics can be computed out-of-sample due to the heldout labeled data available when pct_train is less than 1.

Clustering

Terms:

  • Cluster: a group of records all referring to the same entity (person). Every record in a cluster will get the same person identifier.

  • Invalid links: record pairs that wanted to get clustered together, based on P(match), but couldn’t because it would have caused an edge constraint violation

  • Invalid clusters: record pairs that wanted to get clustered together, based on P(match), but couldn’t because it would have caused a cluster constraint violation

  • Singleton clusters: records that did not match any other record (so are now in a cluster by themselves)

Metrics:

  • Number of invalid predicted links skipped over during record linkage

  • Number of times an invalid cluster was prevented during clustering

  • Number of merges, or number of predicted links that were valid and produced valid clusters

  • Number of singleton clusters in the final set of clusters

  • Number of clusters, or people, discovered across the input dataset(s)