API Reference

`namematch.namematcher`

class namematch.namematcher.NameMatcher(config: dict | None = None, default_params: dict | None = None, og_blocking_index_file: str = 'None', trained_model_info_file: str = 'None', nm_info_file: str = 'nm_info.yaml', log_file_name: str | None = None, logging_params_file: str | None = None, output_dir: str = 'output', output_temp_dir: str | None = None, all_names_file: str = 'all_names.parquet', must_links: str = 'must_links.parquet', blocking_index_bin_file: str = 'blocking_index.bin', candidate_pairs_file: str = 'candidate_pairs.parquet', data_rows_dir: str = 'data_rows', selection_model_name: str = 'basic_selection_model.pkl', match_model_name: str = 'basic_match_model.pkl', flipped0_file: str = 'flipped0_potential_links.csv', model_dir: str = 'model', model_info_file: str = 'model.yaml', potential_edges_dir: str = 'potential_links', cluster_assignments: str = 'cluster_assignments.pkl', edges_to_cluster: str = 'edges_to_cluster.parquet', constraints: str | Constraints | None = None, an_output_file: str = 'all_names_with_clusterid.csv', report_file: str = 'matching_report.html', enable_lprof: bool = False, logging_level: str = 'INFO', params=None, schema=None)[source]

Bases: object

Main interface to run all the steps in namematch

property process_input_data

property generate_must_links

property block

property generate_data_rows

property fit_model

property predict

property cluster

property generate_output

property generate_report

property all_tasks

property nm_metadata: Namematch state including all the necessary attributes to recreate the NameMatcher object

run(force=False, write_params_schema_file=True, write_stats_file=True)[source]

Main method to kick off the namematch process

Parameters:

force (bool) – Force to run all the tasks
write_params_schema_file (bool) – whether to write params and schema to yaml
write_stats_file (bool) – whether to write the nm_info file

classmethod load_namematcher(nm_info_file_path, new_nm_info_file=None, **kwargs)[source]

To load NameMatcher instance given the nm_info_file. This classmethod help us pick up where it left last time based on the information in the nm_info_file. It will create a NameMatcher instance and recover all the attributes as well as stats_dict for tasks that was already run.

Parameters:: nm_info_file (str) – nm_info.yaml file path
Returns:: NameMatcher instance
Return type:: namematch.namematcher.NameMatcher

`namematch.data_structures.parameters`

class namematch.data_structures.parameters.Parameters(validate_param_dict)[source]

Bases: object

Class that houses important matching parameters. Handles validation of the config file.

static check_integrity(defaults, param, param_value)[source]

Ensure that parameters are of the appropriate type.

Parameters:

param (str) – parameter name (key)
param_value – value of the given key (parameter name)

classmethod init(config: dict, defaults: dict)[source]

Create a Parameters instance.

Parameters:

config (dict) – dictionary with match parameter values
defaults (dict) – dictionary with default params

Returns:

instance of the Parameters class

Return type:

namematch.data_structures.parameters.Parameters

classmethod load(filepath)[source]

Load a Parameters instance.

Parameters:: filepath (str) – path to a yaml version of a Parameters instance
Returns:: instance of the Parameters class
Return type:: namematch.data_structures.parameters.Parameters

classmethod load_from_dict(param_dict)[source]

check_for_required_variables(variables)[source]: Validate that the config includes required variables.

validate_exactmatch_variables(variables)[source]: Validate that the exact_match_variables and negate_exact_match_variables parameters.

validate_blocking_scheme(variables)[source]: Validate that the blocking scheme is in the correct format and provieds the minimum number of blocking variables per blocking type (cosine_distance, edit_distance, absvalue_distance).

get_blocking_variables()[source]

Get list of blocking variable nicknames.

Returns:: list of variable nicknames (all-names columns) to use for blocking

validate(variables)[source]: Validate several components of the config file.

write(output_file)[source]

Write the Parameters to a yaml file.

Parameters:: output_file (str) – path to write parameter dictionary

copy()[source]: Create a deep copy of a Parameters object.

stage_params_lookup()[source]

get_stage_params(stage)[source]

`namematch.data_structures.schema`

class namematch.data_structures.schema.Schema(data_files, variables)[source]

Bases: object

Class that houses the most essential instructions for how to complete the match: what data files to match, and which variables to use to do so.

classmethod init(config, params)[source]

Create and validate a DataFileList instance and a VariableList instance.

Parameters:

config (dict) – dictionary with match parameter values
params (dict) – dictionary with processed match parameter values

Returns:

instance of the Schema class

Return type:

namematch.data_structures.schema.Schema

classmethod load(filepath)[source]

Load a Schema instance.

Parameters:: filepath (str) – path to a yaml version of a Schema instance
Returns:: instance of the Schema class
Return type:: namematch.data_structures.schema.Schema

classmethod load_from_dict(schema_dict)[source]

write(output_file)[source]

Write the Schema to a yaml file.

Parameters:: output_file (str) – path to write schema dictionary

`namematch.data_structures.data_file`

class namematch.data_structures.data_file.DataFile(validated_data_file_dict)[source]

Bases: object

Parent class for NewDataFile and ExistingDateFile, which house details about the data files input for matching.

classmethod load(data_file_dict)[source]

Load a DataFile instance (either a NewDataFile or an ExistingDataFile).

Parameters:: data_file_dict (dict) – dictionary-version of a DataFile object
Returns:: instance of the DataFile class

validate_existance()[source]: Validate that an input file path exists.

validate_record_id_col()[source]: Validate that the record_id column exists and meets uniqueness criteria.

copy()[source]: Create a deep copy of a DataFile object.

class namematch.data_structures.data_file.NewDataFile(validated_data_file_dict)[source]

Bases: DataFile

classmethod build(nickname, info)[source]

Create a NewDataFile instance.

Parameters:

nickname (str) – the data file’s nickname
info (dict) – info about a data file definition from user-input config

Returns:

instance of the NewDataFile class

Return type:

namematch.data_structures.data_file.NewDataFile

class namematch.data_structures.data_file.ExistingDataFile(validated_data_file_dict)[source]

Bases: DataFile

classmethod build(nickname, info)[source]

Create a ExistingDataFile instance.

Parameters:

nickname (str) – the data file’s nickname
info (dict) – info about a data file definition from user-input config

Returns:

instance of the ExistingDataFile class

Return type:

namematch.data_structures.data_file.ExistingDataFile

class namematch.data_structures.data_file.DataFileList(data_files_dict)[source]

Bases: object

Class that houses a list of DataFile objects (either NewDataFiles or ExistingDateFiles.

classmethod build(data_files_dict, existing_data_files_dict)[source]

Create a DataFileList instance.

Parameters:

data_files_dict (dict) – dictionary with “new data file” info from user-input config
existing_data_files_dict (dict) – dictionary with “existing data file” info from user-input config

Returns:

instance of the DataFileList class

Return type:

namematch.data_structures.data_file.DataFileList

classmethod load(data_files_list_dict)[source]

Load a DataFileList instance.

Parameters:: data_files_list_dict (dict) – dictionary-version of a DataFileList object
Returns:: instance of the DataFileList class
Return type:: namematch.data_structures.data_file.DataFileList

get_all_nicknames()[source]

Return a list of all of the DataFile nicknames in the DataFileList.

Returns:: list of strings

validate()[source]: Validate the DataFileList by validating the list overall and then validating each individual DataFile.

validate_names()[source]: Validate that the DataFiles in the DataFileList all have unique nicknames and that the output file stems have unique cluster types.

write(output_file)[source]

Write the DataFileList to a yaml file.

Parameters:: output_file (str) – path to write data file list dictionary

copy()[source]: Create a deep copy of a DataFileList object.

get_all_data_files()[source]

Retrieve list of all DataFile objects, regardless of New or Existing.

Returns:: list of DataFile objects

`namematch.data_structures.variable`

class namematch.data_structures.variable.Variable(validated_variable_dict)[source]

Bases: object

classmethod build(variable_dict, params)[source]

Create a Variable instance.

Parameters:

variable_dict (dict) – info about a variable definition from user-input config
params (dict) – dictionary with processed match parameter values

Returns:

instance of the Variable class

Return type:

namematch.data_structures.variable.Variable

validate_col_parameters(data_files)[source]

Validate that each data file has a corresponding “_col” parameter in each variable defintion.

Parameters:: data_files (namematch.data_structures.data_file.DataFileList) – info about what input files are being matched

get_columns_to_read(file_nickname)[source]

Get the name(s) of the column(s) from the input file that need to be read in order to create the current all-names column.

Parameters:: file_nickname (str) – nickname of input file being searched
Returns:: list of column names

get_an_columns()[source]

Get the name(s) of the current all-names column(s).

Returns:: list of column names

copy()[source]: Create a deep copy of a Variable object.

class namematch.data_structures.variable.VariableList(variable_list)[source]

Bases: object

Class that houses a list of Variable objects.

classmethod build(variable_dict_list, params)[source]

Create a VariableList instance.

Parameters:

variable_dict_list (dict) – dictionary with variable info from user-input config
params (dict) – dictionary with processed match parameter values

Returns:

instance of the VariableList class

Return type:

namematch.data_structures.variable.VariableList

classmethod load(variables_list_dict)[source]

Load a VariableList instance.

Parameters:: variables_list_dict (dict) – dictionary version of a VariableList instance
Returns:: instance of the VariableList class
Return type:: namematch.data_structures.variable.VariableList

validate_col_parameters(data_files)[source]

Validate that the “_col” variables referenced in the config’s variable definitions actually exist in the input datasets.

Parameters:: data_files (namematch.data_structures.data_file.DataFileList) – info about what input files are being matched

validate_variable_names()[source]: Validate that the Variables in the VariableList all have unique nicknames.

validate_type_counts(incremental)[source]

Validate that there is exactly one variable with compare type UniqueID. If incremental, validate that there is exactly one variable with compare type ExistingID.

Parameters:: incremental (bool) – True if the config file provides “existing” data files

validate(data_files)[source]

Validate several components of the variables defined in the config.

Parameters:: data_files (namematch.data_structures.data_file.DataFileList) – info about what input files are being matched

get_variables_where(attr, attr_value, equality_type='equals', return_type='name')[source]

Select variables that meet a certain condition (e.g. compare_type == ‘Category’).

Parameters:

attr (str) – variable feature to condition on
attr_value (str) – acceptable values for the variable feature
equality_type (str) – check conditions using either “equals” or “in”
return_type (str) – either “name” or “ix”, determining what return type to use

Returns:

list of variable nicknames (all-names columns) or corresponding all-names column indices

get_names()[source]

Get list of variable nicknames.

Returns:: list of variable nicknames

get_columns_to_read(data_file)[source]

Get the name(s) of the column(s) from the input file that need to be read in order to create the all-names file.

Parameters:: data_file (DataFile object) – contains info about a given input file
Returns:: list of column names

get_an_column_names()[source]

Get the final list of all-names columns, including internally created columns like file_type and drop_from_nm.

Returns:: list of all-names columns

write(output_file)[source]

Write the VariableList to a yaml file.

Parameters:: output_file (str) – path to write variable list dictionary

copy()[source]: Create a deep copy of a VariableList object.

`namematch.process_input_data`

class namematch.process_input_data.ProcessInputData(params, schema, all_names_file='all_names.parquet', *args, **kwargs)[source]

Bases: NamematchBase

property output_files

main(**kw)[source]

Follow the instructions in the schema and params objects to build the all-names file from the raw input file(s).

Parameters:

params (Parameters object) – contains parameter values
schema (Schema object) – contains match schema info (files to match, variables to use, etc.)
all_names_file (str) – path to the all-names file

process_geo_column(df, variable)[source]

Take dataframe of geographic data (either in “lat,lon” format or in “lat”, “lon” format) and ensure it has just one column.

Parameters:

df (pd.DataFrame) – df of address input data (columns are strings)
variable (Variable object) – contains naming info for new geo column

Returns:

DataFrame of clean geographic information for all_names file

Return type:

pd.Dataframe

parse_address(address)[source]

Parse an address string into distinct parts.

Parameters:: address (str) – string of full address (e.g. 54 East 18th Rd.)
Returns:: (address number, street name, street suffix)
Return type:: tuple

process_address_column(df, logger=None)[source]

Take dataframe of address data (either in “123 Main St.” format or “123”, “Main”, “St.” format, order matters) and parse as needed to produce three clean columns: street number, street name, and street type.

Parameters:: df (pd.DataFrame) – df of address input data
Returns:: Dataframe of clean address information for all_names file
Return type:: pd.DataFrame

process_check(s, variable)[source]

Check the validity of the values in a given all-names column (according to the data type and config instructions) and set the series name correctly.

Parameters:

s (pd.Series) – series to process (will be an all-names column)
variable (Variable object) – contains info on how to validate data in series

Returns:

Processed series

Return type:

pd.Series

process_data(df, variables, data_file, params)[source]

Read in part of an input file and process it according to the config in order to create part of the all-names file.

Parameters:

df (pd.DataFrame) – chunk of an input data file
variables (VariableList object) – contains info about the fields for matching (from config)
data_file (DataFile object) – contains info about the input data set
params (dict) – dictionary of param values

Returns:

a chunk of the all-names table (one row per input record)

record_id	unique record identifier
file_type	either “new” or “existing”
<fields for matching>	both for the matching model and for constraint checking
<raw name fields>	pre-cleaning version of first and last name
blockstring	concatenated version of blocking columns (sep by ::)
drop_from_nm	flag, 1 if met any “to drop” criteria 0 otherwise

Return type:

pd.DataFrame

namematch.process_input_data.process_set_missing(s, set_missing_list)[source]

Set values in a series to missing as needed.

Parameters:

s (pd.Series) – strings to process
set_missing_list (list) – list of strings that are disallowed

Returns:

Processed series

Return type:

pd.Series

namematch.process_input_data.process_drop(s, drop_list)[source]

Get the records in a series that have invalid values.

Parameters:

s (pd.Series) – series being processed
drop_list (list of str) – invalid values

Returns:

Indices of records that are not valid

Return type:

list

namematch.process_input_data.process_auto_drops(an, existing_drop_list, drop_logic)[source]

Get the records in all-names that have invalid values due to combination of multiple columns (based on logic in the private config).

Parameters:

an (pd.DataFrame) – all-names chunk being processed
existing_drop_list (list of str) – records already known to be invalid
drop_logic (list of dicts) – logic for what makes a record invalid

Returns:

Indices of records that are not valid

Return type:

list

`namematch.generate_must_links`

class namematch.generate_must_links.GenerateMustLinks(params, schema, all_names_file, must_links, *args, **kwargs)[source]

Bases: NamematchBase

property output_files

main(**kw)[source]

Generate the list of must-link pairs using UniqueID info .

Parameters:

params (Parameters object) – contains parameter values
schema (Schema object) – contains match schema info (files to match, variables to use, etc.)
all_names_file (str) – path to the all-names file
must_links (str) – path to the must-links file

build_ml_var_df(all_names_file, uid_vars_list, **kw)[source]

Load the all-names file and limit it to the rows that have a non-missing UniqueID value.

Parameters:

all_names_file (str) – path to the all-names file
uid_vars_list (list of strings) – all-name columns with compare_type “UniqueID”

Returns:

a subset of the all-names file, relevant colums only

record_id	unique record identifier
blockstring	concatenation of all the blocking variables (sep by ::)
drop_from_nm	flag, 1 if met any “to drop” criteria 0 otherwise
new_record	either True or False
<UniqueID column(s)>	variables of compare_type UniqueID

Return type:

pd.DataFrame

get_must_links(ml_var_df, uid_vars_list, **kw)[source]

Expand the list of records with must-link information to pairs of records that must be linked togehter in the final match.

Parameters:

ml_var_df (pd.DataFrame) –

record_id	unique record identifier
blockstring	concatenation of all the blocking variables (sep by ::)
drop_from_nm	flag, 1 if met any “to drop” criteria 0 otherwise
new_record	either True or False
<UniqueID column(s)>	variables of compare_type UniqueID

uid_vars_list (list of strings) – all-name columns with compare_type “UniqueID”

Returns:

list of must-link record pairs

record_id_1	unique identifier for the first record in the pair
record_id_2	unique identifier for the second record in the pair
blockstring_1	blockstring for the first record in the pair
blockstring_2	blockstring for the second record in the pair
drop_from_nm_1	flag, True if the first record in the pair was not eligible for matching
drop_from_nm_2	flag, True if the second record in the pair was not eligible for matching

Return type:

pd.DataFrame

`namematch.block`

class namematch.block.Block(params, schema, all_names_file='all_names.parquet', must_links_file='must_links.parquet', candidate_pairs_file='candidate_pairs.parquet', blocking_index_bin_file='blocking_index.bin', og_blocking_index_file='None', *args, **kwargs)[source]

Bases: NamematchBase

Parameters:

params (Parameters object) – contains matching parameter values
schema (Schema object) – contains match schema info (files to match, variables to use, etc.)
all_names_file (str) – path to the all-names file
must_links_file (str) – path to the must-links file
blocking_index_bin_file – name of blocking index file
og_blocking_index_file (str) – path to a pre-built nmslib index (optional, if doesn’t exist then None)
candidate_pairs_file (str) – path to the candidate-pairs file

property output_files

main(**kw)[source]: Generate the candidate-pairs list using the blocking scheme outlined in the config.

split_last_names(df, last_name_column, blocking_scheme, **kw)[source]

Expand the processed all-names file to handle double last names (e.g. SAM SMITH-BROWN becomes SAM SMITH and SAM BROWN).

Parameters:

df – all-names table, relevant columns only (where drop_from_nm == 0)
last_name_column (str) – clean last name column
blocking_scheme (dict) – dictionary with info on how to do blocking

Returns:

more rows than input all names, plus orig_last_name and orig_record columns

Return type:

pd.DataFrame

convert_all_names_to_blockstring_info(an, absval_col, params, **kw)[source]

Create a table with information about blockstrings. If the split_names parameter is True, then this function expands double last names to create two new “records” (e.g. SAM SMITH-BROWN becomes SAM SMITH and SAM BROWN).

Parameters:

an (pd.DataFrame) –

all-names table, relevant columns only (where drop_from_nm == 0)

record_id	unique record identifier
blockstring	concatenation of all the blocking variables (sep by ::)
file_type	either “new” or “existing”
drop_from_nm	flag, 1 if met any “to drop” criteria 0 otherwise
<nn-blocking column(s)>	variables for near-neighbor blocking
<ed-blocking column>	variable for edit-distance blocking
<av-blocking column>	(optional) variable for abs-value blocking
nn_string	concatenated version of nn-blocking columns (sep by ::)
ed_string	copy of ed-blocking column
absval_string	copy of abs-value-blocking column

absval_col (str) – column for absolute-value blocking
params (Parameter object) – contains matching parameters

Returns:

tuple containing:

nn_string_info (pd.DataFrame): table with one row per nn_string (or expanded nn_string)

nn_string

concatenated version of nn-blocking columns (sep by ::)

commonness_penalty

float indicating how common the last name is

n_new

number of times this nn_string appears in a “new” record

n_existing

number of times this nn_string appears in an “existing” record

n_total

number of times this nn_string appears in any record

nn_string_expanded_df (pd.DataFrame): table with one row per blockstring (or expanded blockstring)

nn_string

concatenated version of nn-blocking columns (sep by ::)

nn_string_full

(optional) if split_names is True, this is the full (un-split) nn_string

ed_string

copy of ed-blocking column

absval_string

copy of abs-value-blocking column

Return type:

tuple

get_query_strings(nn_string_info, blocking_scheme)[source]

Filter to nn_strings that appear in the new data – these are the only strings for which we need near-neighbors. If incremental is False, this filtering step does nothing.

Parameters:

nn_string_info (pd.DataFrame) –

table with one row per nn_string (or expanded nn_string)

nn_string	concatenated version of nn-blocking columns (sep by ::)
commonness_penalty	float indicating how common the last name is
n_new	number of times this nn_string appears in a “new” record
n_existing	number of times this nn_string appears in an “existing” record
n_total	number of times this nn_string appears in any record

blocking_scheme (dict) – dictionary with info on how to do blocking

Returns:

tuple containing:

nn_string_info_to_query (pd.DataFrame): nn_string_info, subset to nn_strings where n_new > 0
nn_strings_to_query (list): nn_strings that appear at least once in a “new” record
shingles_to_query (scipy.sparse.csr_matrix): sparse weighted shingles matrix for the nn_strings that appear in a new record

Return type:

tuple

generate_shingles_matrix(nn_strings, alpha, power, matrix_type, verbose=True, **kw)[source]

Return a weighted sparse matrix of 2-shingles

Parameters:

alpha (float) – weight of LAST relative to FIRST
power (float) – parameter controlling the impact of name length on cosine distance
matrix_type (str) – description of matrix being built (for logging)
verbose (bool) – True if status messages desired

Returns:

Weighted sparse 2-shingles matrix

Return type:

scipy.sparse.csr_matrix

load_main_index(index_file, **kw)[source]

Load the main index, which is reusable over time as data is added incrementally.

Parameters:: index_file (str) – path to stored index
Returns:: nmslib index object
Return type:: nmslib.FloatIndex

generate_index(nn_strings, num_workers, M, efC, post, alpha, power, print_progress=True, **kw)[source]

Build an nmslib index based on a list of nn_strings and a set of parameters.

Parameters:

nn_strings (list) – strings of the form ‘FIRST::LAST’ to shingle and put in matrix (rows)
num_workers (int) – number of threads nmslib should use when parallelizing
M – nmslib parameters
efc – nmslib parameters
post – nmslib parameters
alpha (float) – weight of last-name relative to first-name
power (float) – parameter controlling the impact of name length on cosine distance
print_progress (bool) – controls verbosity of index creation

Returns:

nmslib index object

Return type:

nmslib.FloatIndex

get_indices(params, all_nn_strings, og_blocking_index_file, **kw)[source]

Wrapper function coordinating the creation and/or loading of the nmslib indices.

Parameters:

params (Parameters object) – contains matching parameter values
all_nn_strings – list of all unique nn_strings in the data (expanded if split_names is True)
og_blocking_index_file – path to a pre-build nmslib index (optional, if doesn’t exist then None)

Returns:

tuple containing:

main_index (nmslib.FloatIndex): the main nmslib index

main_index_nn_strings (list): nn_strings that are in the main nmslib index

second_index (nmslib.FloatIndex): the secondary nmslib index for querying new nn_strings during incremental runs (often None)

second_index_nn_strings (list): nn_strings that are in the secondary nmslib index (often None)

Return type:

tuple

generate_candidate_pairs(nn_strings_to_query, shingles_to_query, nn_string_info, nn_string_expanded_df, main_index, main_index_nn_strings, second_index, second_index_nn_strings, batch_size, **kw)[source]

Wrapper function for querying the nmslib index (or indices) and getting non-matching candidate pairs.

Parameters:

nn_strings_to_query (list) – nn_strings in new data – those that need near neighbors
shingles_to_query (csr_matrix) – shingles matrix for nn_strings_to_query
nn_string_info (pd.DataFrame) – table with one row per nn_string (or expanded nn_string)
nn_string_expanded_df (pd.DataFrame) – maps a nn_string to a ed_string and absval_string
main_index (nmslib index) – the main nmslib index for querying
main_index_nn_strings (list) – nn_strings in main_index
second_index (nmslib index) – the secondary nmslib index, for some incremental runs
second_index_nn_strings (list) – nn_strings in second_index
batch_size (int) – batch size. Default is 10000 and can be modify in config.yaml file.

Returns:

candidate-pairs list, before adding in uncovered pairs

blockstring_1	concatenated version of blocking columns for first element in pair (sep by ::)
blockstring_2	concatenated version of blocking columns for second element in pair (sep by ::)
cos_dist	approximate cosine distance between two nn_strings (nmslib)
edit_dist	number of character edits between ed-strings
covered_pair	flag; 1 for pairs that made it through blocking, 0 otherwise; all 1s here

Return type:

pd.DataFrame

compute_cosine_sim(blockstrings_in_pairs, pairs_df, shingles_matrix, **kw)[source]

Fast cosine similarity computation using the shingles matrix.

Parameters:

blockstrings_in_pairs (list) – used to get index of different strings in shingles_matrix

pairs_df (pd.DataFrame) –

blockstrings you want cosine distance between

blockstring_1	blockstring for the first record in the pair
blockstring_2	blockstring for the second record in the pair
covered_pair	flag, 1 if covered 0 otherwise
nn_strings_1	nn_string for the first record in the pair
nn_strings_2	nn_string for the second record in the pair
both_nn_strings	nn_string_1 + ‘ ‘ + nn_string_2

Returns:

weighted shingles matrix

Return type:

shingles_matrix (csr_matrix)

evaluate_blocking(cp_df, tp_df, blocking_scheme, **kw)[source]

The evaluate_blocking function computes the pair completeness metrics to determine how successful blocking was at minimizing comparisons and maximizing true positives (i.e. generating a candidate pair between records that are actually matches).

Parameters:

cp_df (pd.DataFrame) – candidate pairs df
tp_df (pd.DataFrame) – true pairs df (blockstring_1, blockstring_2)
blocking_scheme (dict) – blocking_scheme (dict): dictionary with info on how to do blocking

Returns:

portion of candidate-pairs dataframe where covered == 0

Return type:

pd.DataFrame

add_uncovered_pairs(candidate_pairs_df, uncovered_pairs_df)[source]

Add the uncovered pairs to the candidate pairs dataframe so that all of the known pairs are in the candidate pairs list.

Parameters:

candidate_pairs_df (pd.DataFrame) – candidate pairs file produced by blocking
uncovered_pairs_df (pd.DataFrame) – uncovered pairs produced by evaluating blocking

Returns:

candidate-pairs file

blockstring_1	concatenated version of blocking columns for first element in pair (sep by ::)
blockstring_2	concatenated version of blocking columns for second element in pair (sep by ::)
cos_dist	approximate cosine distance between two nn_strings (nmslib)
edit_dist	number of character edits between ed-strings
covered_pair	flag; 1 for pairs that made it through blocking, 0 otherwise

Return type:

pd.DataFrame

apply_blocking_filter(df, thresholds, nn_string_expanded_df, nns_match=False)[source]

Compare similarity of names and DOBs to see if a pair of records are likely to be a match.

Parameters:

df (pd.DataFrame) –

holds similarity and commonness info about pairs of names

nn_string_1	concatenated version of nn-blocking columns for first element in pair (sep by ::)
nn_string_2	concatenated version of nn-blocking columns for second element in pair (sep by ::)
cos_dist	approximate cosine distance between two nn_strings (nmslib)
commonness_penalty_1	penalty for last-name commonness for first element in pair
commonness_penalty_2	penalty for last-name commonness for second element in pair

thresholds (dict) – information about what blocking distances are allowed
nn_string_expanded_df (pd.DataFrame) – maps a nn_string to a ed_string and absval_string
nns_match (bool) – True if this function is called by get_exact_match_candidate_pairs

Returns:

chunk of the candidate-pairs list

blockstring_1	concatenated version of blocking columns for first element in pair (sep by ::)
blockstring_2	concatenated version of blocking columns for second element in pair (sep by ::)
cos_dist	approximate cosine distance between two nn_strings (nmslib)
edit_dist	number of character edits between ed-strings
covered_pair	flag; 1 for pairs that made it through blocking, 0 otherwise; all 1s here

Return type:

pd.DataFrame

disallow_switched_pairs(df, incremental, nn_strings_to_query)[source]

Look through the columns nn_string_1 and nn_string_2 and keep only rows where nn_string1 <= nn_string2 to prevent duplicates in the end (i.e. ABBY->ZABBY & ZABBY->ABBY; only one is needed). Special case for incremental runs.

Parameters:

df (pd.DataFrame) – holds similarity and commonness info about pairs of names
incremental (bool) – Ture if current run incremental
nn_strings_to_query (list) – nn_strings that are in “to query” list

Returns:

same as input df, but no AB/BA duplicates

Return type:

pd.DataFrame

get_actual_candidates(near_neighbors_df, nn_string_expanded_df, nn_strings_to_query, thresholds, incremental, output=None)[source]

Actually determines whether two names become candidates; this function is launched by generate_candidate_pairs() and run on individual worker threads to speed up processing.

Parameters:

near_neighbors_df (pd.DataFrame) –

holds similarity and commonness info about pairs of names

nn_string_ix	a string with nn_string_ix = i is the string located at nn_strings_queried_this_batch[i]
nn_string_1	concatenated version of nn-blocking columns for first element in pair (sep by ::)
nn_string_2	concatenated version of nn-blocking columns for second element in pair (sep by ::)
cos_dist	approximate cosine distance between two nn_strings (nmslib)
commonness_penalty_1	penalty for last-name commonness for first element in pair
commonness_penalty_2	penalty for last-name commonness for second element in pair

nn_string_expanded_df (pandas dataframe) – table at nn_string/ed_string/absval_string level (expanded if split_name is True)
nn_strings_to_query (list) – nn_strings in the “to query” list (needed for incremental check)
thresholds (dict) – information about what blocking distances are allowed
incremental (bool) – True if current run incremental
output – None if the output should be returned, rather than written

Returns:

chunk of the candidate-pairs list

blockstring_1	concatenated version of blocking columns for first element in pair (sep by ::)
blockstring_2	concatenated version of blocking columns for second element in pair (sep by ::)
cos_dist	approximate cosine distance between two nn_strings (nmslib)
edit_dist	number of character edits between ed-strings
covered_pair	flag; 1 for pairs that made it through blocking, 0 otherwise; all 1s here

Return type:

pd.DataFrame

get_near_neighbors_df(near_neighbors_list, nn_string_info, nn_strings_this_index, nn_strings_queried_this_batch)[source]

For a small batch of names (nn_strings_queried_this_batch), format a dataframe that enumerates every pair of (name in this batch, a near neighbor), along with information about similarity and commonness.

Parameters:

near_neighbors_list (list) – list of (list of k IDs, list of k distances) tuples, of length batch_size
nn_string_info (pd.DataFrame) – table mapping nn_string to commonness_penalty
nn_strings_this_index (list) – nn_strings in the current index
nn_strings_queried_this_batch (list) – nn_strings in the current query batch (length batch_size), whose neighbors are stored in near_neighbors_list

Returns:

holds similarity and commonness info about pairs of names

nn_string_ix	a string with nn_string_ix = i is the string located at nn_strings_queried_this_batch[i]
nn_string_1	concatenated version of nn-blocking columns for first element in pair (sep by ::)
nn_string_2	concatenated version of nn-blocking columns for second element in pair (sep by ::)
cos_dist	approximate cosine distance between two nn_strings (nmslib)
commonness_penalty_1	penalty for last-name commonness for first element in pair
commonness_penalty_2	penalty for last-name commonness for second element in pair

Return type:

pd.DataFrame

get_exact_match_candidate_pairs(nn_string_info_multi, nn_string_expanded_df, blocking_thresholds)[source]

All nn_strings that appear more than once need to have a corresponding nn_string, nn_string candidate pair – we can skip the “approximation” easily for this type of candidate pair.

Parameters:

nn_string_info_multi (pd.DataFrame) – nn_string_info, subset to nn_strings with n_new > 0 & n_total > 1
nn_string_expanded_df (pd.DataFrame) – table at nn_string/ed_string/absval_string level (expanded if split_name is True)
blocking_thresholds (dict) – dictionary with thresholds for blocking, e.g. high and low bar

Returns:

portion of the candidate pairs list (where nn_string_1 == nn_string_2)

nn_string	concatenated version of nn-blocking columns (sep by ::)
commonness_penalty	float indicating how common the last name is
n_new	number of times this nn_string appears in a “new” record
n_existing	number of times this nn_string appears in an “existing” record
n_total	number of times this nn_string appears in any record

Return type:

pd.DataFrame

namematch.block.get_blocking_columns(blocking_scheme)[source]

Get the list of blocking variables for each type of blocking:

Parameters:: blocking_scheme (dict) – dictionary with info on how to do blocking
Returns:: the variable names needed for each type of blocking
Return type:: list of string list

namematch.block.read_an(an_file, nn_cols, ed_col, absval_col)[source]

Read in relevant columns for blocking from the all-names file.

Parameters:

an_file (str) – path to the all-names file
nn_cols (list of strings) – variables for near neighbor blocking
ed_col (list of strings) – variables for edit-distance blocking
absval_col (list of strings) – variables for absolute-value blocking

Returns:

all-names dataframe, relevant columns only (where drop_from_nm == 0)

record_id	unique record identifier
blockstring	concatenation of all the blocking variables (sep by ::)
file_type	either “new” or “existing”
drop_from_nm	flag, 1 if met any “to drop” criteria 0 otherwise
<nn-blocking column(s)>	variables for near-neighbor blocking
<ed-blocking column>	variable for edit-distance blocking
<av-blocking column>	(optional) variable for abs-value blocking
nn_string	concatenated version of nm-blocking columns (sep by ::)
ed_string	copy of ed-blocking column
absval_string	copy of ed-blocking column

Return type:

pd.DataFrame

namematch.block.get_nn_string_counts(an)[source]

Count number of records per nn_strings (per file_type).

Parameters:

an (pd.DataFrame) –

all-names table, relevant columns only (where drop_from_nm == 0)

record_id	unique record identifier
blockstring	concatenation of all the blocking variables (sep by ::)
file_type	either “new” or “existing”
drop_from_nm	flag, 1 if met any “to drop” criteria 0 otherwise
<nn-blocking column(s)>	variables for near-neighbor blocking
<ed-blocking column>	variable for edit-distance blocking
<av-blocking column>	(optional) variable for abs-value blocking
nn_string	concatenated version of nm-blocking columns (sep by ::)
ed_string	copy of ed-blocking column
absval_string	copy of ed-blocking column

Returns:

two keys (new and existing), mapping to a dictionary of nn_strings to n_records

Return type:

dict

namematch.block.get_common_name_penalties(clean_last_names, max_penalty, num_threshold_bins=1000)[source]

Create a dictionary mapping each last name to a “commonness penalty.” Two SMITHs are less likely to be the same person than two HANDAs, since SMITH is such a common name. This function quantifies this penalty for use in later blocking calculations. A more common name recieves a higher number, topping out at max_penalty.

Parameters:

clean_last_names (pd.Series) – clean (un-split) last name column (one row per record)
max_penalty (float) – the maximum penalty (for the most common names)
num_threshold_bins (int) – number of different categories of commonnness to create

Returns:

dictionary mapping name (str) to penalty (float)

Return type:

dict

namematch.block.get_all_shingles()[source]

Get all valid 2-shingles.

Returns:: valid 2-shingles
Return type:: list

namematch.block.prep_index()[source]

Initialize index data structure, which will store similarity information about the names, and load processed shingles into it.

Returns:: nmslib index object (pre time-consuming build call)
Return type:: nnmslib.FloatIndex

namematch.block.get_second_index_nn_strings(all_nn_strings, main_nn_strings)[source]

Get nn_strings that haven’t already been stored in the main index.

Parameters:

all_nn_strings (list) – list of all nn_strings in the data (expanded if split_names is True)
main_nn_strings (list) – list of nn_strings already in the main index

Returns:

the nn_strings that are not in main_nn_strings

Return type:

list

namematch.block.save_main_index(main_index, main_index_nn_strings, main_index_file)[source]

Save the main nmslib index and pickle dump the associated nn_strings list.

Parameters:

main_index (nmslib.FloatIndex) – the main, built nmslib index
main_index_nn_strings (list) – list of nn_strings in the main index
main_index_file (str) – path to store the main nmslib index

namematch.block.load_main_index_nn_strings(og_blocking_index_file)[source]

Load the nn_strings that are in an existing nmslbi index file.

Parameters:: og_blocking_index_file (str) – path to original blocking index
Returns:: loaded list of nn_strings in an existing nmslib index
Return type:: list

namematch.block.write_some_cps(cand_pairs, candidate_pairs_file)[source]

Write out a portion of the candidate-pairs to parquet.

Parameters:

cand_pairs (pd.DataFrame) – chunk of the candidate-pairs file
candidate_pairs_file (str) – path to the candidate-pairs file

namematch.block.generate_true_pairs(must_links_df)[source]

Reduce the must-link records pairs must-link blockstring pairs.

Parameters:

must_links_df (pd.DataFrame) –

list of must-link record pairs

record_id_1	unique identifier for the first record in the pair
record_id_2	unique identifier for the second record in the pair
blockstring_1	blockstring for the first record in the pair
blockstring_2	blockstring for the second record in the pair
drop_from_nm_1	flag, 1 if the first record in the pair was not eligible for matching
drop_from_nm_2	flag, 1 if the second record in the pair was not eligible for matching
existing	flag, 1 if the pair is must-link because of ExistingID

Returns:

list of must-link blockstring pairs (where both record have drop_from_nm == 0)

blockstring_1	blockstring for the first record in the pair
blockstring_2	blockstring for the second record in the pair

Return type:

pd.DataFrame

`namematch.generate_data_rows`

class namematch.generate_data_rows.GenerateDataRows(params, schema, output_dir, all_names_file, candidate_pairs_file, *args, **kwargs)[source]

Bases: NamematchBase

property output_files

main(**kw)[source]

Take candidate pairs and merge on the all-names records (twice) to get a dataset at the record pair level. Compute distance metrics between the records in the pair – these are the features for modeling.

Parameters:

params (Parameters object) – contains parameter values
schema (Schema object) – contains match schema info (files to match, variables to use, etc.)
all_names_file (str) – path to the all-names file
candidate_pairs_file (str) – path to the candidate-pairs file
output_dir (str) – path to the data-rows dir

generate_name_probabilities_object(an, fn_col=None, ln_col=None, **kw)[source]

The generate_name_probabilites function uses a list of names (from all_names file) and creates an object containing queryable probability information for each name.

Parameters:

an (pd.DataFrame) – all-names, just the name columns
fn_col (str) – name of first name column
ln_col (str) – name of last name column

Returns:

name probability object

find_valid_training_records(an, an_match_criteria, **kw)[source]

generate_actual_data_rows(params, schema, sbs_df, np_object, first_iter)[source]

Create modeling dataframe by comparing each variable (via numerous distance metrics).

Parameters:

params (Parameters object) – contains matching parameters
schema (Schema object) – contains matching schema (data files and variables)

sbs_df (pd.DatFrame) –

side-by-side table (record pair level, with info from both an records)

record_id (_1, _2)	unique record identifier
blockstring (_1, _2)	concatenated version of blocking columns (sep by ::)
file_type (_1, _2)	either “new” or “existing”
candidate_pair_ix	index from candidate-pairs list
covered_pair	flag, 1 if blockstring pair passed blocking 0 otherwise
<fields for matching> (_1, _2)	both for the matching model and for constraint checking

np_object (nm_prob.NameProbability object) – contains information about name probabilities

Returns:

chunk of the data-rows file

dr_id	unique record pair identifier (record_id_1__record_id_2)
record_id (_1, _2)	unique record identifiers
<distance metrics>	how similar are the different matching fields between recrods
label	”1” if the records refer to the same person, “0” if not, “” otherwise

Return type:

pd.DataFrame

generate_data_row_files(params, schema, an, cp_df, name_probs, start_ix_worker, end_ix_worker, dr_file, **kw)[source]

The get_data_row_files function is run in parallel to generate the data needed for the random forest; it performs the merge between candidate pairs and all-names and calls the function that calculates distance metrics.

Parameters:

params (Parameters object) – contains matching parameters
schema (Schema object) – contains matching schema (data files and variables)

an (pd.DatFrame) –

all-names table (one row per input record)

record_id	unique record identifier
file_type	either “new” or “existing”
<fields for matching>	both for the matching model and for constraint checking
<raw name fields>	pre-cleaning version of first and last name
blockstring	concatenated version of blocking columns (sep by ::)
drop_from_nm	flag, 1 if met any “to drop” criteria 0 otherwise

cp_df (pd.DataFrame) –

candidate-pairs list

blockstring_1	concatenated version of blocking columns for first element in pair (sep by ::)
blockstring_2	concatenated version of blocking columns for second element in pair (sep by ::)
covered_pair	flag; 1 for pairs that made it through blocking, 0 otherwise; all 1s here

name_probs (nm_prob.NameProbability object) – contains information about name probabilities
start_ix_worker (int) – starting index of the candidate-pairs chunk to read in this thread
end_ix_worker (int) – end index of the candidate-pairs chunk to read in this thread
dr_file (str) – path to data-rows file to write (one for each worker thread)

`namematch.fit_model`

class namematch.fit_model.FitModel(params, all_names_file, data_rows_dir, model_info_file, output_dir, trained_model_info_file='None', selection_model_name='basic_selection_model.pkl', match_model_name='basic_match_model.pkl', flipped0_file=None, *args, **kwargs)[source]

Bases: NamematchBase

Parameters:

params (Parameters object) – contains parameter values
all_names_file (str) – path to the all-names file
data_rows_dir (str) – path to the data-rows dir
model_info_file (str) – path to the model info yaml file
output_dir (str) – path to the model dir
traiend_model_info_file (str) – path to the model info yaml file of a previously trained model
selection_model_name (str) – selection model name
match_model_name (str) – match model name
flipped0_file (str) – flipped0 file path

property output_files

property dr_file_list

main(**kw)[source]: Train and evaluate random foreset model(s). Depending on the settings, this might involved training and evaluating multiple types of models (e.g. selection and match models) and/or models for different data-row types (e.g. basic and no-dob).

fit_model(df, vars_to_exclude, outcome, weights=None, n_jobs=1, **kw)[source]

Fit random forest model.

Parameters:

df (pd.DataFrame) – data rows, subset to training rows
vars_to_exclude (list) – variables to disallow from the model
outcome (string) – name of the column that we’re predicting
weights (list) – sample weights to use for training (can be None)
n_jobs (int) – number of jobs to run in parallel

Returns:

tuple containing:

mod (sklearn.ensemble.RandomForestClassifier): trained sklearn random forest model object
feature_info (pd.DataFrame): feature_importance

Return type:

tuple

fit_models(train_df, model_type, model_info)[source]

Fit random forest model.

Parameters:

train_df (pd.DataFrame) – data rows, subset to training rows
model_type (string) – either “selection” or “match”
model_info (dict) – dict with information about how to fit the model

Returns:

maps model name (e.g. basic or no_dob) to a trained model object

Return type:

dict

evaluate_models(phats_df, outcome, model_type, weight_using_selection_model=False, default_threshold=0.5, missingness_model_threshold_boost=0.2, optimize_threshold=False, fscore_beta=1.0, **kw)[source]

get_train_eval_data(an_train_eligible_dict, model_info, params, model_type, any_train=True)[source]

Load data-rows, filter to rows that are eligible for training a givem model type, and then split the data into a training set and a labeled evaulation set.

Parameters:

an_train_eligible_dict (dict) – maps record_id to flag indicating record’s all-names based training eligibility
model_info (dict) – dict with information about how to fit the model
params (Parameters object) – contains parameter values
model_type (str) – either “selection” or “match”
any_train (bool) – True if you want training data (e.g. not a pre-trained model), False otherwise

Returns:

data rows, filtered to training data (excluding labeled eval data) pd.DataFrame: data rows, filtered to labeled eval data float: share of data rows that are labeled

Return type:

pd.DataFrame

find_valid_training_records(an, an_match_criteria)[source]

Identify records that meet the all-names criteria for training data.

Parameters:

an (pd.DataFrame) –

all-names table (one row per input record)

record_id	unique record identifier
file_type	either “new” or “existing”
<fields for matching>	both for the matching model and for constraint checking
<raw name fields>	pre-cleaning version of first and last name
blockstring	concatenated version of blocking columns (sep by ::)
drop_from_nm	flag, 1 if met any “to drop” criteria 0 otherwise

an_match_criteria (dict) – keys are all-names columns, mapped to acceptable values

Returns:

flag, 1 if the record is eligible for training set 0 otherwise

Return type:

pd.Series

namematch.fit_model.get_feature_info(pipeline, raw_num_cols, raw_cat_cols)[source]

Extract the feature importance information from a sklearn model pipeline.

Parameters:

pipeline (skleran fitted pipeline) – trained model
raw_num_cols (list) – numeric columns that went into the model (before pipeline processing)
raw_cat_cols (list) – categorical columns that went into the model (before pipeline processing)

Returns:

feature importance information

feature	name of the feautre
importance	relative importance of this feature to the model

Return type:

pd.DataFrame

namematch.fit_model.save_models(selection_models, match_models, model_info)[source]

Save the models to file.

Parameters:

selection_models (dict) – maps model name (e.g. basic or no-dob) to a fit match model object
match_models (dict) – maps model name (e.g. basic or no-dob) to a fit selection model object
model_info (dict) – dict with information about how to fit the model

namematch.fit_model.define_necessary_models(dr_file_list, output_dir, missing_mod_field=None, selection_model_name='basic_selection_model.pkl', match_model_name='basic_match_model.pkl')[source]

Determine the different models needed (using a sample) and define the characteristics of data that determine which model should handle it.

NOTE: Right now, there is an assumption that the training universe: is the same between all models (i.e. basic and missingness)

Parameters:

dr_file_list (list) – list of paths to all data row files
output_dir (str) – model output folder path
missing_mod_field (str or None) – field that could trigger need for separate model

Returns:

mapping the name of a model (str) to a dict of the following information:

selection_model_name (str)

match_model_name (str)

type (str): one of “default” or “missingness”

actual_phat_universe (dict): maps a variable name to a value(?)

vars_to_exclude (str list)

match_thresh (float): threshold for match/nonmatch

Return type:

dict

namematch.fit_model.load_and_save_trained_model(trained_model_info_file, output_file)[source]

Load a set of pre-trained models and copy them to the current run’s output directory. Typically only used in incremental runs.

Parameters:

trained_model_info_file (str) – path to a model yaml file, which has path/threshold/universe info
output_file (str) – path to output the current run’s model yaml file (for copying)

Returns:

maps model name (e.g. basic or no_dob) to a trained model object

Return type:

dict

namematch.fit_model.get_match_train_eligible_flag(df, dr_train_eligible_conditions_dict, an_train_eligible_dict)[source]

Determine if a data-row is eligible for training (for match models), according to both all-names eligibility criteria and data-row eligibility criteria.

Parameters:

df (pd.DataFrame) – portion of data-rows file, limited to labeled rows
dr_train_eligible_conditions_dict (dict) – contains data-row training eligibility criteria
an_train_eligible_dict (dict) – maps record_id to flag indicating record’s all-names based training eligibility

Returns:

flag, 1 if data-row is training eligible (for match models)

Return type:

pd.Series

namematch.fit_model.add_threshold_dict(model_info, thresholds_dict)[source]

Add threshold information to the model_info dict, once it’s been determined.

Parameters:

model_info (dict) – dict with information about how to fit the model
thresholds_dict (dict) – keys are model name (e.g. basic, no-dob), values are optimized thresholds

Returns:

model dict, now with threshold info

Return type:

dict

namematch.fit_model.get_flipped0_potential_edges(phats_df, model_info, allow_clusters_w_multiple_unique_ids)[source]

If allowed, identify the set of labeled 0s with high phats so they can be treated as matches downstream.

Parameters:

phats_df (pd.DataFrame) –

phat info for record pairs

record_id (_1, _2)	unique record identifiers
model_to_use	based on pair characteristics, which model to use (e.g. basic or no-dob)
covered_pair	did the pair make it through blocking
match_train_eligible	is the pair eligible for training (for match model)
exactmatch	is the pair an exact match on name/dob
label	whether the pair is a match or not
<phat_col>	predicted probability of match

model_info (dict) – dict with information about how to fit the model
allow_clusters_w_multiple_unique_ids (bool) – param controlling if 0s can be flipped to 1

Returns:

same as phats_df, just

Return type:

pd.DataFrame

`namematch.predict`

class namematch.predict.Predict(params, data_rows_dir, model_info_file, output_dir, *args, **kwargs)[source]

Bases: NamematchBase

Parameters:

params (Parameters object) – contains parameter values
model_info_file (str) – path to the data-rows dir
data_rows_dir (str) – path to the model info yaml file for a trained model
output_dir (str) – path to the potential-links dir

property output_files

property dr_file_list

main(**kw)[source]: Read in data-rows and predict (in parallel) for each unlabeled pair. Output the pairs above the threshold.

get_potential_edges(dr_file, match_models, model_info, output_dir, params, **kw)[source]

Read in data rows in chunks and predict as needed. Write (append) the edges above the threshold to the appropriate file.

Parameters:

dr_file (string) – path to data file to predict for
match_models (dict) – maps model name (e.g. basic or no-dob) to a fit match model object
model_info (dict) – contains information about threshold
output_dir (str) – directory to place potential links
params (Parameters obj) – contains parameter values (i.e. use_uncovered_phats)

get_potential_edges_in_parallel(match_models, model_info, output_dir, params)[source]

Dispatch the worker threads that will predict for unlabeled pairs in paralle.

Parameters:

match_models (dict) – maps model name (e.g. basic or no-dob) to a fit match model object
model_info (dict) – dict with information about how to fit the model
output_dir –
params (Parameters object) – contains parameter values

classmethod predict(models, df, model_type, oob=False, all_cols=False, all_models=True, prob_match_train=None)[source]

Use the trainined models to predict for pairs of records.

Parameters:

models (dict) – maps model name (e.g. basic or no-dob) to a fit match model object
df (pd.DataFrame) – portion of the data-rows table, with a “model_to_use” column appended
model_type (str) – model type (e.g. selection or match)
oob (bool) – if True, use the out-of-bag predictions
all_cols (bool) – if True, keep all columns in the output df; not just the relevant ones
all_models (bool) – if True, predict for each row using all models, not just the “model to use”
prob_match_train (float) – share of data-rows that are labeled

`namematch.cluster`

class namematch.cluster.Constraints[source]

Bases: object

property get_columns_used

property is_valid_link

property is_valid_cluster

property apply_link_priority

class namematch.cluster.Cluster(params, schema, must_links_file='must_links.parquet', potential_edges_dir='potential_links', flipped0_edges_file='flipped0_potential_links.csv', all_names_file='all_names.parquet', cluster_assignments='cluster_assignments.pkl', edges_to_cluster='edges_to_cluster.parquet', constraints: str | Constraints | None = None, *args, **kwargs)[source]

Bases: NamematchBase

Parameters:

params (Parameters object) – contains parameter values
schema (Schema object) – contains match schema info (files to match, variables to use, etc.)
constraints (str or Constrants object) – either a path to python script defining constraint functions or a Constraints object
must_links_file (str) – path to the must-links file
potential_edges_dir (str) – path to the potential-links dir in the output/details folder
flipped0_edges_file (str) – path to the flipped-links file
all_names_file (str) – path to the all-names file
cluster_assignments (str) – path to the cluster-assignments file

property output_files

main(**kw)[source]: Read the record pairs with high probability of matching and connect them in a way that doesn’t violate any logic constraints to form clusters.

get_cluster_logic(constraints)[source]

auto_is_valid_edge(edges_df, uid_cols, allow_clusters_w_multiple_unique_ids, leven_thresh, eid_col=None)[source]

Check if two records would violate a unique id or existing id constraint.

Parameters:

edges_df (pd.DataFrame) –

potential edges information

record_id_1	unique record identifier (for first in pair)
record_id_2	unique record identifier (for second in pair)
phat	predicted probability of a record pair being a match
original_order	original ordering 1-N (useful so gt is always on top of phat=1 cases)

uid_cols (list) – all-names column(s) with compare_type UniqueID
allow_clusters_w_multiple_unique_ids (bool) – True if a cluster can have multiple uid values
leven_thresh (int) – n character edits to allow between uids before they’re considered different
eid_col (str) – all-names column with compare_type ExistingID (None for non-incremental runs)

Returns:

potential edges information, but limited to rows that pass the automated validity check

Return type:

valid_edges_df

auto_is_valid_cluster(cluster, uid_cols, allow_clusters_w_multiple_unique_ids, leven_thresh, eid_col=None)[source]

Check if a proposed cluster would violate a unique id or existing id constraint.

Parameters:

cluster (pd.DataFrame) – all-names file (relevant columns only) records for the proposed cluster
uid_cols (list) – all-names column(s) with compare_type UniqueID
allow_clusters_w_multiple_unique_ids (bool) – True if a cluster can have multiple uid values
leven_thresh (int) – n character edits to allow between uids before they’re considered different
eid_col (str) – all-names column with compare_type ExistingID (None for non-incremental runs)

Returns:

False if an automated constraint is violated

Return type:

bool

get_initial_clusters(must_links_df, an_df, eid_col, **kw)[source]

Use must links (ground truth and/or a previous run) to create the starting clusters.

Parameters:

must_links_df (pd.DataFrame) –

record pairs that must be linked together no matter what

record_id_1	unique identifier for the first record in the pair
record_id_2	unique identifier for the second record in the pair
blockstring_1	blockstring for the first record in the pair
blockstring_2	blockstring for the second record in the pair
drop_from_nm_1	flag, 1 if the first record in the pair was not eligible for matching
drop_from_nm_2	flag, 1 if the second record in the pair was not eligible for matching
existing	flag, 1 if the pair is must-link because of ExistingID

an_df (pd.DataFrame) –

all-names file, with only the columns relevant for clustering

record_id	unique record identifier
<uid column(s)>	columns with compare_type UniqueID
<eid column(s)>	columns with compare_type ExistingID
<user-constraint column(s)>	(optional) columns mentioned in get_columns_used()

eid_col (str) – all-names column with compare_type ExistingID, or None

Returns:

clusters maps a cluster id to a list of record ids dict: cluster_assignments maps a record_id to a cluster_id set: cluster ids that are already in use (only for incremental)

Return type:

dict

save_df_to_disk(df)[source]

get_potential_edges(potential_edges_files, flipped0_edges_file, gt_1s_df, cluster_logic, cluster_info, uid_cols, eid_col, **kw)[source]

Use all predictions file to make a list of edges that the constrained clustering algorithm should try to add.

Parameters:

potential_edges_files (list) – paths to the potential links files
flipped0_edges_file (str) – path to the flipped0-links file
gt_1s_df (pd.DataFrame) – known y=1s; will be matched, pending the edge/cluster validity
cluster_logic (module) – user-defined constraint functions
cluster_info (pd.DataFrame) – all-names file, with only the columns relevant for clustering
uid_cols (list) – all-name columns with compare_type UniqueID
eid_col (str) – all-name column with compare_type ExistingID

load_cluster_info(all_names_file, uid_cols, eid_col, cluster_logic, **kw)[source]

Read in the all_names information needed for cluster constraint checking. Columns defined in the config as compare type UniqueID or ExistingID will automatically be loaded (as strings, with missing values represented as NA). Other columns you wish to be loaded should be defined in the user-defined get_columns_used() function.

Parameters:

all_names_file (str) – path to the all-names file
uid_cols (list) – all-name columns with compare_type UniqueID
eid_col (str) – all-name column with compare_type ExistingID
cluster_logic (module) – user-defined constraint functions

Returns:

all-names file, with only the columns relevant for clustering

record_id	unique record identifier
<uid column(s)>	columns with compare_type UniqueID
<eid column(s)>	columns with compare_type ExistingID
<user-constraint column(s)>	(optional) columns mentioned in get_columns_used()

Return type:

pd.DataFrame

get_ci_ix_map(cluster_info)[source]

cluster_potential_edges(clusters, cluster_assignments, original_cluster_ids, cluster_info, cluster_logic, uid_cols, eid_col, **kw)[source]

For clusters by add potential edges to the cluster graph in order of importance, skipping those that cause violations.

Parameters:

clusters (dict) – maps a cluster id to a list of record ids – post initialization
cluster_assignments (dict) – maps a record_id to a cluster_id – post initialization
original_cluster_ids (set) – set: cluster ids that are already in use (only for incremental)

cluster_info (pd.DataFrame) –

all-names file, with only the columns relevant for clustering

record_id	unique record identifier
<uid column(s)>	columns with compare_type UniqueID
<eid column(s)>	columns with compare_type ExistingID
<user-constraint column(s)>	(optional) columns mentioned in get_columns_used()

potential_edges (deque) – each element is a dict version of a potential edge’s record
cluster_logic (module) – user-defined constraint functions
uid_cols (list) – all-name columns with compare_type UniqueID
eid_col (str) – all-name column with compare_type ExistingID

Returns:

maps record_id to cluster_id

Return type:

dict

`namematch.generate_output`

class namematch.generate_output.GenerateOutput(params, schema, all_names_file, cluster_assignments_file, an_output_file, output_dir, output_file_uuid=None, *args, **kwargs)[source]

Bases: NamematchBase

Parameters:

params (Parameters object) – contains parameter values
schema (Schema object) – contains match schema info (files to match, variables to use, etc.)
all_names_file (str) – path to the all-names file
cluster_assignments_file (str) – path to the cluster-assignments file
an_output_file (str) – path to the all-names-with-clusterid file
output_dir (str) – path to final output directory

property output_files

main(**kw)[source]: Read in the cluster assignments dictionary and use it to create all-names-with-cluster-id and the “with-cluster-id” versions of input dataset.

create_allnames_clusterid_file(all_names_file, cluster_assignments, cleaned_col_names, **kw)[source]

Create all-names-with-clusterid dataframe.

Parameters:

all_names_file (str) – path to the all-names file
cluster_assignments (dict) – maps record_id to cluster_id
cleaned_col_names (list) – all-name columns used in cosine blocking

Returns:

all-names-with-cluster-id

record_id	unique record identifier
file_type	either “new” or “existing”
<fields for matching>	both for the matching model and for constraint checking
blockstring	concatenated version of blocking columns (sep by ::)
drop_from_nm	flag, 1 if met any “to drop” criteria 0 otherwise
cluster_id	unique person identifier, no missing values

Return type:

pd.DataFrame

output_clusterid_files(data_files, cluster_assignments, output_dir, output_file_uuid=None, **kw)[source]

For each input file, construct a matching output file that has the cluster_id column, and write it.

Parameters:

data_files (list of DataFile objects) – contains info about each input file
cluster_assignments (dict) – maps record_id to cluster_id
output_dir (str) – the path that was supplied when the name match object was created

`namematch.utils.utils`

class namematch.utils.utils.StatLogFilter[source]

Bases: object

filter(logRecord)[source]

namematch.utils.utils.setup_logging(log_params, log_filepath, output_temp_dir, filter_stats=False, logging_level='INFO')[source]

Setup logging configuration.

Parameters:

log_params (dict) – contains info for logging setup
log_filepath (str) – path to store logs

namematch.utils.utils.log_stat(human_desc, yaml_desc, value)[source]

Log a statistic in the log and in the stats yaml.

Parameters:

human_desc (str) – human readable description of the stat (could be a phrase)
yaml_desc (str) – concise yaml-key compatible description of the stat
value (float or str) – value of the stat

namematch.utils.utils.log_runtime_and_memory(method)[source]

Decorator that logs time to execute functions and records max memory usage in GB.

Parameters:: method (function) – function to measure/log runtime and memory usage
Returns:: value returned by the function being decorated

namematch.utils.utils.load_yaml(yaml_file)[source]

Load a yaml file into a dictionary.

Parameters:: yaml_file (str) – path to yaml file
Returns:: dictionary version of input yaml file
Return type:: dict

namematch.utils.utils.dump_yaml(dict_to_write, yaml_file)[source]

Write a dictionary into a yaml file.

Parameters:

dict_to_write (dict) – dict to write to yaml
yaml_file (str) – path to output yaml file

namematch.utils.utils.to_dict(obj)[source]

Convert an object (i.e. instance of a user-defined class) into a dictionary to make writing easier.

Parameters:: obj (object) – class instance to convert to dict

namematch.utils.utils.create_nm_record_id(nickname, record_id_series)[source]

namematch.utils.utils.clean_nn_string(n)[source]

Removes JR, SR, II, extra spaces, etc. from nn strings. The original string in the dataframe keeps punctuation and suffixes.

Parameters:: n (str) – raw name value
Returns:: clean version of the input name
Return type:: str

namematch.utils.utils.build_blockstring(df, blocking_scheme, incl_ed_string=True)[source]

Create blockstrings (values for blocking separated by ::, such as JOHN::SMITH::1993-07-23) from all-names data.

Parameters:

df (pd.DataFrame) –

all-names table

record_id	unique record identifier
file_type	either “new” or “existing”
<fields for matching>	both for the matching model and for constraint checking
drop_from_nm	flag, 1 if met any “to drop” criteria 0 otherwise

blocking_scheme (dict) – contains info about fields to block on
incl_ed_string (bool) – True if the blockstring should end with the edit-distance string (e.g. dob)

Returns:

blockstrings

Return type:

pd.Series

namematch.utils.utils.get_nn_string_from_blockstring(blockstring)[source]

Parse out the near-neighbor string (e.g. first-name and last-name) from a blockstring.

Parameters:: blockstring (str) – string with info for blocking (e.g. JOHN::SMITH::1993-07-23)
Returns:: near-neighbor string (e.g. JOHN::SMITH)
Return type:: str

namematch.utils.utils.get_ed_string_from_blockstring(blockstring)[source]

Parse out the edit-distance string (e.g. dob) from a blockstring.

Parameters:: blockstring (str) – string with info for blocking (e.g. JOHN::SMITH::1993-07-23)
Returns:: edit-distance string (e.g. 1993-07-23)
Return type:: str

namematch.utils.utils.get_endpoints(n, num_chunks)[source]

Divide a number into some number of chunks/intervals.

Parameters:

n (int) – number to divide into chunks/intervals
num_chunks (int) – number of chunks/intervals to create

Returns:

list of start and end points to cover entire range

Return type:

list of int tuples

namematch.utils.utils.load_sample(csv_path, pct, cols=None)[source]

Load a random sample of a csv into pandas.

Parameters:

csv_path (str) – path to csv file
pct (float) – what percent of the file to randomly read
cols (list) – columns to load

Returns:

random subset of the input csv

Return type:

pd.DataFrame

namematch.utils.utils.load_csv_list(df_file_list, cols=None, conditions_dict={}, sample=1)[source]

Read a list of .csv files into a single pd.DataFrame.

Parameters:

df_file_list (list of str) – list of .csv files to read
cols (list) – columns to keep in the dataframe
conditions_dict (dict) – conditions for row filtering
sample (float) – share of rows to randomly sample from the final dataframe

Returns:

filtered sampled dataframe read in from the .csv files

Return type:

pd.DataFrame

namematch.utils.utils.load_parquet(df_file, cols=None, conditions_dict={})[source]

Read a .parquet file into a pd.DataFrame.

Parameters:

df_file (str) – .parquet file to read
cols (list) – columns to keep in the dataframe
conditions_dict (dict) – conditions for row filtering

Returns:

filtered dataframe read in from the .parquet file

Return type:

pd.DataFrame

namematch.utils.utils.load_parquet_list(df_file_list, cols=None, conditions_dict={}, sample=1)[source]

Read a list of .parquet files into a single pd.DataFrame.

Parameters:

df_file_list (list of str) – list of .parquet files to read
cols (list) – columns to keep in the dataframe
conditions_dict (dict) – conditions for row filtering
sample (float) – share of rows to randomly sample from the final dataframe

Returns:

filtered sampled dataframe read in from the .parquet files

Return type:

pd.DataFrame

namematch.utils.utils.determine_model_to_use(dr_df, model_info, verbose=False)[source]

Assign a model to each data row based on which fields are available.

Parameters:

dr_df (pd.DataFrame) –

data rows

record_id_1	unique identifier for the first record in the pair
record_id_2	unique identifier for the second record in the pair
<distance metric fields>	distance metrics between the two records’ matching fields
label	flag, “1” if the records are a match, “0” if not, “” if unknown

model_info (dict) – information about models and their universes
verbose (bool) – flag controlling logging statement (set according to which function calls this one)

Returns:

string indicating which model to use for a given record pair

Return type:

pd.Series

namematch.utils.utils.load_models(model_info_file, selection=False)[source]

Load pre-trained models (selection and match, as available)

Parameters:

model_info_file (str) – path to original model config
selection (bool) – if True, try to load a corresponding selection model

Returns:

maps model name (e.g. basic or no-dob) to a fit model object dict: dict with information about how to fit the model

Return type:

dict

namematch.utils.utils.recursively_convert_tuple_to_list(value)[source]

namematch.utils.utils.luigi_dict_parameter_to_dict(d)[source]

namematch.utils.utils.filename_friendly_hash(inputs)[source]

namematch.utils.utils.load_logging_params(logging_params_file=None)[source]

namematch.utils.utils.reformat_dict(d: dict)[source]: make all the string values in the yaml file have double quotes

namematch.utils.utils.camel_to_snake(name)[source]

`namematch.generate_report`

class namematch.generate_report.IgnoreBlackWarning(name='')[source]

Bases: Filter

Initialize a filter.

Initialize with the name of the logger which, together with its children, will have its events allowed through the filter. If no name is specified, allow every event.

filter(record)[source]

Determine if the specified record is to be logged.

Returns True if the record should be logged, or False otherwise. If deemed appropriate, the record may be modified in-place.

class namematch.generate_report.GenerateReport(params, schema, report_file, *args, **kwargs)[source]

Bases: NamematchBase

params (Parameters object): contains parameter values schema (Schema object): contains match schema info (files to match, variables to use, etc.) report_file (str): full path of the report html file

property output_files

main(**kw)[source]: Main method for each task class which is called in namematcher.NameMatcher