API Reference

namematch.namematcher

class namematch.namematcher.NameMatcher(config: dict | None = None, default_params: dict | None = None, og_blocking_index_file: str = 'None', trained_model_info_file: str = 'None', nm_info_file: str = 'nm_info.yaml', log_file_name: str | None = None, logging_params_file: str | None = None, output_dir: str = 'output', output_temp_dir: str | None = None, all_names_file: str = 'all_names.parquet', must_links: str = 'must_links.parquet', blocking_index_bin_file: str = 'blocking_index.bin', candidate_pairs_file: str = 'candidate_pairs.parquet', data_rows_dir: str = 'data_rows', selection_model_name: str = 'basic_selection_model.pkl', match_model_name: str = 'basic_match_model.pkl', flipped0_file: str = 'flipped0_potential_links.csv', model_dir: str = 'model', model_info_file: str = 'model.yaml', potential_edges_dir: str = 'potential_links', cluster_assignments: str = 'cluster_assignments.pkl', edges_to_cluster: str = 'edges_to_cluster.parquet', constraints: str | Constraints | None = None, an_output_file: str = 'all_names_with_clusterid.csv', report_file: str = 'matching_report.html', enable_lprof: bool = False, logging_level: str = 'INFO', params=None, schema=None)[source]

Bases: object

Main interface to run all the steps in namematch

property process_input_data
property block
property generate_data_rows
property fit_model
property predict
property cluster
property generate_output
property generate_report
property all_tasks
property nm_metadata

Namematch state including all the necessary attributes to recreate the NameMatcher object

run(force=False, write_params_schema_file=True, write_stats_file=True)[source]

Main method to kick off the namematch process

Parameters:
  • force (bool) – Force to run all the tasks

  • write_params_schema_file (bool) – whether to write params and schema to yaml

  • write_stats_file (bool) – whether to write the nm_info file

classmethod load_namematcher(nm_info_file_path, new_nm_info_file=None, **kwargs)[source]

To load NameMatcher instance given the nm_info_file. This classmethod help us pick up where it left last time based on the information in the nm_info_file. It will create a NameMatcher instance and recover all the attributes as well as stats_dict for tasks that was already run.

Parameters:

nm_info_file (str) – nm_info.yaml file path

Returns:

NameMatcher instance

Return type:

namematch.namematcher.NameMatcher

namematch.data_structures.parameters

class namematch.data_structures.parameters.Parameters(validate_param_dict)[source]

Bases: object

Class that houses important matching parameters. Handles validation of the config file.

static check_integrity(defaults, param, param_value)[source]

Ensure that parameters are of the appropriate type.

Parameters:
  • param (str) – parameter name (key)

  • param_value – value of the given key (parameter name)

classmethod init(config: dict, defaults: dict)[source]

Create a Parameters instance.

Parameters:
  • config (dict) – dictionary with match parameter values

  • defaults (dict) – dictionary with default params

Returns:

instance of the Parameters class

Return type:

namematch.data_structures.parameters.Parameters

classmethod load(filepath)[source]

Load a Parameters instance.

Parameters:

filepath (str) – path to a yaml version of a Parameters instance

Returns:

instance of the Parameters class

Return type:

namematch.data_structures.parameters.Parameters

classmethod load_from_dict(param_dict)[source]
check_for_required_variables(variables)[source]

Validate that the config includes required variables.

validate_exactmatch_variables(variables)[source]

Validate that the exact_match_variables and negate_exact_match_variables parameters.

validate_blocking_scheme(variables)[source]

Validate that the blocking scheme is in the correct format and provieds the minimum number of blocking variables per blocking type (cosine_distance, edit_distance, absvalue_distance).

get_blocking_variables()[source]

Get list of blocking variable nicknames.

Returns:

list of variable nicknames (all-names columns) to use for blocking

validate(variables)[source]

Validate several components of the config file.

write(output_file)[source]

Write the Parameters to a yaml file.

Parameters:

output_file (str) – path to write parameter dictionary

copy()[source]

Create a deep copy of a Parameters object.

stage_params_lookup()[source]
get_stage_params(stage)[source]

namematch.data_structures.schema

class namematch.data_structures.schema.Schema(data_files, variables)[source]

Bases: object

Class that houses the most essential instructions for how to complete the match: what data files to match, and which variables to use to do so.

classmethod init(config, params)[source]

Create and validate a DataFileList instance and a VariableList instance.

Parameters:
  • config (dict) – dictionary with match parameter values

  • params (dict) – dictionary with processed match parameter values

Returns:

instance of the Schema class

Return type:

namematch.data_structures.schema.Schema

classmethod load(filepath)[source]

Load a Schema instance.

Parameters:

filepath (str) – path to a yaml version of a Schema instance

Returns:

instance of the Schema class

Return type:

namematch.data_structures.schema.Schema

classmethod load_from_dict(schema_dict)[source]
write(output_file)[source]

Write the Schema to a yaml file.

Parameters:

output_file (str) – path to write schema dictionary

namematch.data_structures.data_file

class namematch.data_structures.data_file.DataFile(validated_data_file_dict)[source]

Bases: object

Parent class for NewDataFile and ExistingDateFile, which house details about the data files input for matching.

classmethod load(data_file_dict)[source]

Load a DataFile instance (either a NewDataFile or an ExistingDataFile).

Parameters:

data_file_dict (dict) – dictionary-version of a DataFile object

Returns:

instance of the DataFile class

validate_existance()[source]

Validate that an input file path exists.

validate_record_id_col()[source]

Validate that the record_id column exists and meets uniqueness criteria.

copy()[source]

Create a deep copy of a DataFile object.

class namematch.data_structures.data_file.NewDataFile(validated_data_file_dict)[source]

Bases: DataFile

classmethod build(nickname, info)[source]

Create a NewDataFile instance.

Parameters:
  • nickname (str) – the data file’s nickname

  • info (dict) – info about a data file definition from user-input config

Returns:

instance of the NewDataFile class

Return type:

namematch.data_structures.data_file.NewDataFile

class namematch.data_structures.data_file.ExistingDataFile(validated_data_file_dict)[source]

Bases: DataFile

classmethod build(nickname, info)[source]

Create a ExistingDataFile instance.

Parameters:
  • nickname (str) – the data file’s nickname

  • info (dict) – info about a data file definition from user-input config

Returns:

instance of the ExistingDataFile class

Return type:

namematch.data_structures.data_file.ExistingDataFile

class namematch.data_structures.data_file.DataFileList(data_files_dict)[source]

Bases: object

Class that houses a list of DataFile objects (either NewDataFiles or ExistingDateFiles.

classmethod build(data_files_dict, existing_data_files_dict)[source]

Create a DataFileList instance.

Parameters:
  • data_files_dict (dict) – dictionary with “new data file” info from user-input config

  • existing_data_files_dict (dict) – dictionary with “existing data file” info from user-input config

Returns:

instance of the DataFileList class

Return type:

namematch.data_structures.data_file.DataFileList

classmethod load(data_files_list_dict)[source]

Load a DataFileList instance.

Parameters:

data_files_list_dict (dict) – dictionary-version of a DataFileList object

Returns:

instance of the DataFileList class

Return type:

namematch.data_structures.data_file.DataFileList

get_all_nicknames()[source]

Return a list of all of the DataFile nicknames in the DataFileList.

Returns:

list of strings

validate()[source]

Validate the DataFileList by validating the list overall and then validating each individual DataFile.

validate_names()[source]

Validate that the DataFiles in the DataFileList all have unique nicknames and that the output file stems have unique cluster types.

write(output_file)[source]

Write the DataFileList to a yaml file.

Parameters:

output_file (str) – path to write data file list dictionary

copy()[source]

Create a deep copy of a DataFileList object.

get_all_data_files()[source]

Retrieve list of all DataFile objects, regardless of New or Existing.

Returns:

list of DataFile objects

namematch.data_structures.variable

class namematch.data_structures.variable.Variable(validated_variable_dict)[source]

Bases: object

classmethod build(variable_dict, params)[source]

Create a Variable instance.

Parameters:
  • variable_dict (dict) – info about a variable definition from user-input config

  • params (dict) – dictionary with processed match parameter values

Returns:

instance of the Variable class

Return type:

namematch.data_structures.variable.Variable

validate_col_parameters(data_files)[source]

Validate that each data file has a corresponding “_col” parameter in each variable defintion.

Parameters:

data_files (namematch.data_structures.data_file.DataFileList) – info about what input files are being matched

get_columns_to_read(file_nickname)[source]

Get the name(s) of the column(s) from the input file that need to be read in order to create the current all-names column.

Parameters:

file_nickname (str) – nickname of input file being searched

Returns:

list of column names

get_an_columns()[source]

Get the name(s) of the current all-names column(s).

Returns:

list of column names

copy()[source]

Create a deep copy of a Variable object.

class namematch.data_structures.variable.VariableList(variable_list)[source]

Bases: object

Class that houses a list of Variable objects.

classmethod build(variable_dict_list, params)[source]

Create a VariableList instance.

Parameters:
  • variable_dict_list (dict) – dictionary with variable info from user-input config

  • params (dict) – dictionary with processed match parameter values

Returns:

instance of the VariableList class

Return type:

namematch.data_structures.variable.VariableList

classmethod load(variables_list_dict)[source]

Load a VariableList instance.

Parameters:

variables_list_dict (dict) – dictionary version of a VariableList instance

Returns:

instance of the VariableList class

Return type:

namematch.data_structures.variable.VariableList

validate_col_parameters(data_files)[source]

Validate that the “_col” variables referenced in the config’s variable definitions actually exist in the input datasets.

Parameters:

data_files (namematch.data_structures.data_file.DataFileList) – info about what input files are being matched

validate_variable_names()[source]

Validate that the Variables in the VariableList all have unique nicknames.

validate_type_counts(incremental)[source]

Validate that there is exactly one variable with compare type UniqueID. If incremental, validate that there is exactly one variable with compare type ExistingID.

Parameters:

incremental (bool) – True if the config file provides “existing” data files

validate(data_files)[source]

Validate several components of the variables defined in the config.

Parameters:

data_files (namematch.data_structures.data_file.DataFileList) – info about what input files are being matched

get_variables_where(attr, attr_value, equality_type='equals', return_type='name')[source]

Select variables that meet a certain condition (e.g. compare_type == ‘Category’).

Parameters:
  • attr (str) – variable feature to condition on

  • attr_value (str) – acceptable values for the variable feature

  • equality_type (str) – check conditions using either “equals” or “in”

  • return_type (str) – either “name” or “ix”, determining what return type to use

Returns:

list of variable nicknames (all-names columns) or corresponding all-names column indices

get_names()[source]

Get list of variable nicknames.

Returns:

list of variable nicknames

get_columns_to_read(data_file)[source]

Get the name(s) of the column(s) from the input file that need to be read in order to create the all-names file.

Parameters:

data_file (DataFile object) – contains info about a given input file

Returns:

list of column names

get_an_column_names()[source]

Get the final list of all-names columns, including internally created columns like file_type and drop_from_nm.

Returns:

list of all-names columns

write(output_file)[source]

Write the VariableList to a yaml file.

Parameters:

output_file (str) – path to write variable list dictionary

copy()[source]

Create a deep copy of a VariableList object.

namematch.process_input_data

class namematch.process_input_data.ProcessInputData(params, schema, all_names_file='all_names.parquet', *args, **kwargs)[source]

Bases: NamematchBase

property output_files
main(**kw)[source]

Follow the instructions in the schema and params objects to build the all-names file from the raw input file(s).

Parameters:
  • params (Parameters object) – contains parameter values

  • schema (Schema object) – contains match schema info (files to match, variables to use, etc.)

  • all_names_file (str) – path to the all-names file

process_geo_column(df, variable)[source]

Take dataframe of geographic data (either in “lat,lon” format or in “lat”, “lon” format) and ensure it has just one column.

Parameters:
  • df (pd.DataFrame) – df of address input data (columns are strings)

  • variable (Variable object) – contains naming info for new geo column

Returns:

DataFrame of clean geographic information for all_names file

Return type:

pd.Dataframe

parse_address(address)[source]

Parse an address string into distinct parts.

Parameters:

address (str) – string of full address (e.g. 54 East 18th Rd.)

Returns:

(address number, street name, street suffix)

Return type:

tuple

process_address_column(df, logger=None)[source]

Take dataframe of address data (either in “123 Main St.” format or “123”, “Main”, “St.” format, order matters) and parse as needed to produce three clean columns: street number, street name, and street type.

Parameters:

df (pd.DataFrame) – df of address input data

Returns:

Dataframe of clean address information for all_names file

Return type:

pd.DataFrame

process_check(s, variable)[source]

Check the validity of the values in a given all-names column (according to the data type and config instructions) and set the series name correctly.

Parameters:
  • s (pd.Series) – series to process (will be an all-names column)

  • variable (Variable object) – contains info on how to validate data in series

Returns:

Processed series

Return type:

pd.Series

process_data(df, variables, data_file, params)[source]

Read in part of an input file and process it according to the config in order to create part of the all-names file.

Parameters:
  • df (pd.DataFrame) – chunk of an input data file

  • variables (VariableList object) – contains info about the fields for matching (from config)

  • data_file (DataFile object) – contains info about the input data set

  • params (dict) – dictionary of param values

Returns:

a chunk of the all-names table (one row per input record)

record_id

unique record identifier

file_type

either “new” or “existing”

<fields for matching>

both for the matching model and for constraint checking

<raw name fields>

pre-cleaning version of first and last name

blockstring

concatenated version of blocking columns (sep by ::)

drop_from_nm

flag, 1 if met any “to drop” criteria 0 otherwise

Return type:

pd.DataFrame

namematch.process_input_data.process_set_missing(s, set_missing_list)[source]

Set values in a series to missing as needed.

Parameters:
  • s (pd.Series) – strings to process

  • set_missing_list (list) – list of strings that are disallowed

Returns:

Processed series

Return type:

pd.Series

namematch.process_input_data.process_drop(s, drop_list)[source]

Get the records in a series that have invalid values.

Parameters:
  • s (pd.Series) – series being processed

  • drop_list (list of str) – invalid values

Returns:

Indices of records that are not valid

Return type:

list

namematch.process_input_data.process_auto_drops(an, existing_drop_list, drop_logic)[source]

Get the records in all-names that have invalid values due to combination of multiple columns (based on logic in the private config).

Parameters:
  • an (pd.DataFrame) – all-names chunk being processed

  • existing_drop_list (list of str) – records already known to be invalid

  • drop_logic (list of dicts) – logic for what makes a record invalid

Returns:

Indices of records that are not valid

Return type:

list

namematch.block

class namematch.block.Block(params, schema, all_names_file='all_names.parquet', must_links_file='must_links.parquet', candidate_pairs_file='candidate_pairs.parquet', blocking_index_bin_file='blocking_index.bin', og_blocking_index_file='None', *args, **kwargs)[source]

Bases: NamematchBase

Parameters:
  • params (Parameters object) – contains matching parameter values

  • schema (Schema object) – contains match schema info (files to match, variables to use, etc.)

  • all_names_file (str) – path to the all-names file

  • must_links_file (str) – path to the must-links file

  • blocking_index_bin_file – name of blocking index file

  • og_blocking_index_file (str) – path to a pre-built nmslib index (optional, if doesn’t exist then None)

  • candidate_pairs_file (str) – path to the candidate-pairs file

property output_files
main(**kw)[source]

Generate the candidate-pairs list using the blocking scheme outlined in the config.

split_last_names(df, last_name_column, blocking_scheme, **kw)[source]

Expand the processed all-names file to handle double last names (e.g. SAM SMITH-BROWN becomes SAM SMITH and SAM BROWN).

Parameters:
  • df – all-names table, relevant columns only (where drop_from_nm == 0)

  • last_name_column (str) – clean last name column

  • blocking_scheme (dict) – dictionary with info on how to do blocking

Returns:

more rows than input all names, plus orig_last_name and orig_record columns

Return type:

pd.DataFrame

convert_all_names_to_blockstring_info(an, absval_col, params, **kw)[source]

Create a table with information about blockstrings. If the split_names parameter is True, then this function expands double last names to create two new “records” (e.g. SAM SMITH-BROWN becomes SAM SMITH and SAM BROWN).

Parameters:
  • an (pd.DataFrame) –

    all-names table, relevant columns only (where drop_from_nm == 0)

    record_id

    unique record identifier

    blockstring

    concatenation of all the blocking variables (sep by ::)

    file_type

    either “new” or “existing”

    drop_from_nm

    flag, 1 if met any “to drop” criteria 0 otherwise

    <nn-blocking column(s)>

    variables for near-neighbor blocking

    <ed-blocking column>

    variable for edit-distance blocking

    <av-blocking column>

    (optional) variable for abs-value blocking

    nn_string

    concatenated version of nn-blocking columns (sep by ::)

    ed_string

    copy of ed-blocking column

    absval_string

    copy of abs-value-blocking column

  • absval_col (str) – column for absolute-value blocking

  • params (Parameter object) – contains matching parameters

Returns:

tuple containing:

  • nn_string_info (pd.DataFrame): table with one row per nn_string (or expanded nn_string)

    nn_string

    concatenated version of nn-blocking columns (sep by ::)

    commonness_penalty

    float indicating how common the last name is

    n_new

    number of times this nn_string appears in a “new” record

    n_existing

    number of times this nn_string appears in an “existing” record

    n_total

    number of times this nn_string appears in any record

  • nn_string_expanded_df (pd.DataFrame): table with one row per blockstring (or expanded blockstring)

    nn_string

    concatenated version of nn-blocking columns (sep by ::)

    nn_string_full

    (optional) if split_names is True, this is the full (un-split) nn_string

    ed_string

    copy of ed-blocking column

    absval_string

    copy of abs-value-blocking column

Return type:

tuple

get_query_strings(nn_string_info, blocking_scheme)[source]

Filter to nn_strings that appear in the new data – these are the only strings for which we need near-neighbors. If incremental is False, this filtering step does nothing.

Parameters:
  • nn_string_info (pd.DataFrame) –

    table with one row per nn_string (or expanded nn_string)

    nn_string

    concatenated version of nn-blocking columns (sep by ::)

    commonness_penalty

    float indicating how common the last name is

    n_new

    number of times this nn_string appears in a “new” record

    n_existing

    number of times this nn_string appears in an “existing” record

    n_total

    number of times this nn_string appears in any record

  • blocking_scheme (dict) – dictionary with info on how to do blocking

Returns:

tuple containing:

  • nn_string_info_to_query (pd.DataFrame): nn_string_info, subset to nn_strings where n_new > 0

  • nn_strings_to_query (list): nn_strings that appear at least once in a “new” record

  • shingles_to_query (scipy.sparse.csr_matrix): sparse weighted shingles matrix for the nn_strings that appear in a new record

Return type:

tuple

generate_shingles_matrix(nn_strings, alpha, power, matrix_type, verbose=True, **kw)[source]

Return a weighted sparse matrix of 2-shingles

Parameters:
  • alpha (float) – weight of LAST relative to FIRST

  • power (float) – parameter controlling the impact of name length on cosine distance

  • matrix_type (str) – description of matrix being built (for logging)

  • verbose (bool) – True if status messages desired

Returns:

Weighted sparse 2-shingles matrix

Return type:

scipy.sparse.csr_matrix

load_main_index(index_file, **kw)[source]

Load the main index, which is reusable over time as data is added incrementally.

Parameters:

index_file (str) – path to stored index

Returns:

nmslib index object

Return type:

nmslib.FloatIndex

generate_index(nn_strings, num_workers, M, efC, post, alpha, power, print_progress=True, **kw)[source]

Build an nmslib index based on a list of nn_strings and a set of parameters.

Parameters:
  • nn_strings (list) – strings of the form ‘FIRST::LAST’ to shingle and put in matrix (rows)

  • num_workers (int) – number of threads nmslib should use when parallelizing

  • M – nmslib parameters

  • efc – nmslib parameters

  • post – nmslib parameters

  • alpha (float) – weight of last-name relative to first-name

  • power (float) – parameter controlling the impact of name length on cosine distance

  • print_progress (bool) – controls verbosity of index creation

Returns:

nmslib index object

Return type:

nmslib.FloatIndex

get_indices(params, all_nn_strings, og_blocking_index_file, **kw)[source]

Wrapper function coordinating the creation and/or loading of the nmslib indices.

Parameters:
  • params (Parameters object) – contains matching parameter values

  • all_nn_strings – list of all unique nn_strings in the data (expanded if split_names is True)

  • og_blocking_index_file – path to a pre-build nmslib index (optional, if doesn’t exist then None)

Returns:

tuple containing:

  • main_index (nmslib.FloatIndex): the main nmslib index

  • main_index_nn_strings (list): nn_strings that are in the main nmslib index

  • second_index (nmslib.FloatIndex): the secondary nmslib index for querying new nn_strings during incremental runs (often None)

  • second_index_nn_strings (list): nn_strings that are in the secondary nmslib index (often None)

Return type:

tuple

generate_candidate_pairs(nn_strings_to_query, shingles_to_query, nn_string_info, nn_string_expanded_df, main_index, main_index_nn_strings, second_index, second_index_nn_strings, batch_size, **kw)[source]

Wrapper function for querying the nmslib index (or indices) and getting non-matching candidate pairs.

Parameters:
  • nn_strings_to_query (list) – nn_strings in new data – those that need near neighbors

  • shingles_to_query (csr_matrix) – shingles matrix for nn_strings_to_query

  • nn_string_info (pd.DataFrame) – table with one row per nn_string (or expanded nn_string)

  • nn_string_expanded_df (pd.DataFrame) – maps a nn_string to a ed_string and absval_string

  • main_index (nmslib index) – the main nmslib index for querying

  • main_index_nn_strings (list) – nn_strings in main_index

  • second_index (nmslib index) – the secondary nmslib index, for some incremental runs

  • second_index_nn_strings (list) – nn_strings in second_index

  • batch_size (int) – batch size. Default is 10000 and can be modify in config.yaml file.

Returns:

candidate-pairs list, before adding in uncovered pairs

blockstring_1

concatenated version of blocking columns for first element in pair (sep by ::)

blockstring_2

concatenated version of blocking columns for second element in pair (sep by ::)

cos_dist

approximate cosine distance between two nn_strings (nmslib)

edit_dist

number of character edits between ed-strings

covered_pair

flag; 1 for pairs that made it through blocking, 0 otherwise; all 1s here

Return type:

pd.DataFrame

compute_cosine_sim(blockstrings_in_pairs, pairs_df, shingles_matrix, **kw)[source]

Fast cosine similarity computation using the shingles matrix.

Parameters:
  • blockstrings_in_pairs (list) – used to get index of different strings in shingles_matrix

  • pairs_df (pd.DataFrame) –

    blockstrings you want cosine distance between

    blockstring_1

    blockstring for the first record in the pair

    blockstring_2

    blockstring for the second record in the pair

    covered_pair

    flag, 1 if covered 0 otherwise

    nn_strings_1

    nn_string for the first record in the pair

    nn_strings_2

    nn_string for the second record in the pair

    both_nn_strings

    nn_string_1 + ‘ ‘ + nn_string_2

Returns:

weighted shingles matrix

Return type:

shingles_matrix (csr_matrix)

evaluate_blocking(cp_df, tp_df, blocking_scheme, **kw)[source]

The evaluate_blocking function computes the pair completeness metrics to determine how successful blocking was at minimizing comparisons and maximizing true positives (i.e. generating a candidate pair between records that are actually matches).

Parameters:
  • cp_df (pd.DataFrame) – candidate pairs df

  • tp_df (pd.DataFrame) – true pairs df (blockstring_1, blockstring_2)

  • blocking_scheme (dict) – blocking_scheme (dict): dictionary with info on how to do blocking

Returns:

portion of candidate-pairs dataframe where covered == 0

Return type:

pd.DataFrame

add_uncovered_pairs(candidate_pairs_df, uncovered_pairs_df)[source]

Add the uncovered pairs to the candidate pairs dataframe so that all of the known pairs are in the candidate pairs list.

Parameters:
  • candidate_pairs_df (pd.DataFrame) – candidate pairs file produced by blocking

  • uncovered_pairs_df (pd.DataFrame) – uncovered pairs produced by evaluating blocking

Returns:

candidate-pairs file

blockstring_1

concatenated version of blocking columns for first element in pair (sep by ::)

blockstring_2

concatenated version of blocking columns for second element in pair (sep by ::)

cos_dist

approximate cosine distance between two nn_strings (nmslib)

edit_dist

number of character edits between ed-strings

covered_pair

flag; 1 for pairs that made it through blocking, 0 otherwise

Return type:

pd.DataFrame

apply_blocking_filter(df, thresholds, nn_string_expanded_df, nns_match=False)[source]

Compare similarity of names and DOBs to see if a pair of records are likely to be a match.

Parameters:
  • df (pd.DataFrame) –

    holds similarity and commonness info about pairs of names

    nn_string_1

    concatenated version of nn-blocking columns for first element in pair (sep by ::)

    nn_string_2

    concatenated version of nn-blocking columns for second element in pair (sep by ::)

    cos_dist

    approximate cosine distance between two nn_strings (nmslib)

    commonness_penalty_1

    penalty for last-name commonness for first element in pair

    commonness_penalty_2

    penalty for last-name commonness for second element in pair

  • thresholds (dict) – information about what blocking distances are allowed

  • nn_string_expanded_df (pd.DataFrame) – maps a nn_string to a ed_string and absval_string

  • nns_match (bool) – True if this function is called by get_exact_match_candidate_pairs

Returns:

chunk of the candidate-pairs list

blockstring_1

concatenated version of blocking columns for first element in pair (sep by ::)

blockstring_2

concatenated version of blocking columns for second element in pair (sep by ::)

cos_dist

approximate cosine distance between two nn_strings (nmslib)

edit_dist

number of character edits between ed-strings

covered_pair

flag; 1 for pairs that made it through blocking, 0 otherwise; all 1s here

Return type:

pd.DataFrame

disallow_switched_pairs(df, incremental, nn_strings_to_query)[source]

Look through the columns nn_string_1 and nn_string_2 and keep only rows where nn_string1 <= nn_string2 to prevent duplicates in the end (i.e. ABBY->ZABBY & ZABBY->ABBY; only one is needed). Special case for incremental runs.

Parameters:
  • df (pd.DataFrame) – holds similarity and commonness info about pairs of names

  • incremental (bool) – Ture if current run incremental

  • nn_strings_to_query (list) – nn_strings that are in “to query” list

Returns:

same as input df, but no AB/BA duplicates

Return type:

pd.DataFrame

get_actual_candidates(near_neighbors_df, nn_string_expanded_df, nn_strings_to_query, thresholds, incremental, output=None)[source]

Actually determines whether two names become candidates; this function is launched by generate_candidate_pairs() and run on individual worker threads to speed up processing.

Parameters:
  • near_neighbors_df (pd.DataFrame) –

    holds similarity and commonness info about pairs of names

    nn_string_ix

    a string with nn_string_ix = i is the string located at nn_strings_queried_this_batch[i]

    nn_string_1

    concatenated version of nn-blocking columns for first element in pair (sep by ::)

    nn_string_2

    concatenated version of nn-blocking columns for second element in pair (sep by ::)

    cos_dist

    approximate cosine distance between two nn_strings (nmslib)

    commonness_penalty_1

    penalty for last-name commonness for first element in pair

    commonness_penalty_2

    penalty for last-name commonness for second element in pair

  • nn_string_expanded_df (pandas dataframe) – table at nn_string/ed_string/absval_string level (expanded if split_name is True)

  • nn_strings_to_query (list) – nn_strings in the “to query” list (needed for incremental check)

  • thresholds (dict) – information about what blocking distances are allowed

  • incremental (bool) – True if current run incremental

  • output – None if the output should be returned, rather than written

Returns:

chunk of the candidate-pairs list

blockstring_1

concatenated version of blocking columns for first element in pair (sep by ::)

blockstring_2

concatenated version of blocking columns for second element in pair (sep by ::)

cos_dist

approximate cosine distance between two nn_strings (nmslib)

edit_dist

number of character edits between ed-strings

covered_pair

flag; 1 for pairs that made it through blocking, 0 otherwise; all 1s here

Return type:

pd.DataFrame

get_near_neighbors_df(near_neighbors_list, nn_string_info, nn_strings_this_index, nn_strings_queried_this_batch)[source]

For a small batch of names (nn_strings_queried_this_batch), format a dataframe that enumerates every pair of (name in this batch, a near neighbor), along with information about similarity and commonness.

Parameters:
  • near_neighbors_list (list) – list of (list of k IDs, list of k distances) tuples, of length batch_size

  • nn_string_info (pd.DataFrame) – table mapping nn_string to commonness_penalty

  • nn_strings_this_index (list) – nn_strings in the current index

  • nn_strings_queried_this_batch (list) – nn_strings in the current query batch (length batch_size), whose neighbors are stored in near_neighbors_list

Returns:

holds similarity and commonness info about pairs of names

nn_string_ix

a string with nn_string_ix = i is the string located at nn_strings_queried_this_batch[i]

nn_string_1

concatenated version of nn-blocking columns for first element in pair (sep by ::)

nn_string_2

concatenated version of nn-blocking columns for second element in pair (sep by ::)

cos_dist

approximate cosine distance between two nn_strings (nmslib)

commonness_penalty_1

penalty for last-name commonness for first element in pair

commonness_penalty_2

penalty for last-name commonness for second element in pair

Return type:

pd.DataFrame

get_exact_match_candidate_pairs(nn_string_info_multi, nn_string_expanded_df, blocking_thresholds)[source]

All nn_strings that appear more than once need to have a corresponding nn_string, nn_string candidate pair – we can skip the “approximation” easily for this type of candidate pair.

Parameters:
  • nn_string_info_multi (pd.DataFrame) – nn_string_info, subset to nn_strings with n_new > 0 & n_total > 1

  • nn_string_expanded_df (pd.DataFrame) – table at nn_string/ed_string/absval_string level (expanded if split_name is True)

  • blocking_thresholds (dict) – dictionary with thresholds for blocking, e.g. high and low bar

Returns:

portion of the candidate pairs list (where nn_string_1 == nn_string_2)

nn_string

concatenated version of nn-blocking columns (sep by ::)

commonness_penalty

float indicating how common the last name is

n_new

number of times this nn_string appears in a “new” record

n_existing

number of times this nn_string appears in an “existing” record

n_total

number of times this nn_string appears in any record

Return type:

pd.DataFrame

namematch.block.get_blocking_columns(blocking_scheme)[source]

Get the list of blocking variables for each type of blocking:

Parameters:

blocking_scheme (dict) – dictionary with info on how to do blocking

Returns:

the variable names needed for each type of blocking

Return type:

list of string list

namematch.block.read_an(an_file, nn_cols, ed_col, absval_col)[source]

Read in relevant columns for blocking from the all-names file.

Parameters:
  • an_file (str) – path to the all-names file

  • nn_cols (list of strings) – variables for near neighbor blocking

  • ed_col (list of strings) – variables for edit-distance blocking

  • absval_col (list of strings) – variables for absolute-value blocking

Returns:

all-names dataframe, relevant columns only (where drop_from_nm == 0)

record_id

unique record identifier

blockstring

concatenation of all the blocking variables (sep by ::)

file_type

either “new” or “existing”

drop_from_nm

flag, 1 if met any “to drop” criteria 0 otherwise

<nn-blocking column(s)>

variables for near-neighbor blocking

<ed-blocking column>

variable for edit-distance blocking

<av-blocking column>

(optional) variable for abs-value blocking

nn_string

concatenated version of nm-blocking columns (sep by ::)

ed_string

copy of ed-blocking column

absval_string

copy of ed-blocking column

Return type:

pd.DataFrame

namematch.block.get_nn_string_counts(an)[source]

Count number of records per nn_strings (per file_type).

Parameters:

an (pd.DataFrame) –

all-names table, relevant columns only (where drop_from_nm == 0)

record_id

unique record identifier

blockstring

concatenation of all the blocking variables (sep by ::)

file_type

either “new” or “existing”

drop_from_nm

flag, 1 if met any “to drop” criteria 0 otherwise

<nn-blocking column(s)>

variables for near-neighbor blocking

<ed-blocking column>

variable for edit-distance blocking

<av-blocking column>

(optional) variable for abs-value blocking

nn_string

concatenated version of nm-blocking columns (sep by ::)

ed_string

copy of ed-blocking column

absval_string

copy of ed-blocking column

Returns:

two keys (new and existing), mapping to a dictionary of nn_strings to n_records

Return type:

dict

namematch.block.get_common_name_penalties(clean_last_names, max_penalty, num_threshold_bins=1000)[source]

Create a dictionary mapping each last name to a “commonness penalty.” Two SMITHs are less likely to be the same person than two HANDAs, since SMITH is such a common name. This function quantifies this penalty for use in later blocking calculations. A more common name recieves a higher number, topping out at max_penalty.

Parameters:
  • clean_last_names (pd.Series) – clean (un-split) last name column (one row per record)

  • max_penalty (float) – the maximum penalty (for the most common names)

  • num_threshold_bins (int) – number of different categories of commonnness to create

Returns:

dictionary mapping name (str) to penalty (float)

Return type:

dict

namematch.block.get_all_shingles()[source]

Get all valid 2-shingles.

Returns:

valid 2-shingles

Return type:

list

namematch.block.prep_index()[source]

Initialize index data structure, which will store similarity information about the names, and load processed shingles into it.

Returns:

nmslib index object (pre time-consuming build call)

Return type:

nnmslib.FloatIndex

namematch.block.get_second_index_nn_strings(all_nn_strings, main_nn_strings)[source]

Get nn_strings that haven’t already been stored in the main index.

Parameters:
  • all_nn_strings (list) – list of all nn_strings in the data (expanded if split_names is True)

  • main_nn_strings (list) – list of nn_strings already in the main index

Returns:

the nn_strings that are not in main_nn_strings

Return type:

list

namematch.block.save_main_index(main_index, main_index_nn_strings, main_index_file)[source]

Save the main nmslib index and pickle dump the associated nn_strings list.

Parameters:
  • main_index (nmslib.FloatIndex) – the main, built nmslib index

  • main_index_nn_strings (list) – list of nn_strings in the main index

  • main_index_file (str) – path to store the main nmslib index

namematch.block.load_main_index_nn_strings(og_blocking_index_file)[source]

Load the nn_strings that are in an existing nmslbi index file.

Parameters:

og_blocking_index_file (str) – path to original blocking index

Returns:

loaded list of nn_strings in an existing nmslib index

Return type:

list

namematch.block.write_some_cps(cand_pairs, candidate_pairs_file)[source]

Write out a portion of the candidate-pairs to parquet.

Parameters:
  • cand_pairs (pd.DataFrame) – chunk of the candidate-pairs file

  • candidate_pairs_file (str) – path to the candidate-pairs file

namematch.block.generate_true_pairs(must_links_df)[source]

Reduce the must-link records pairs must-link blockstring pairs.

Parameters:

must_links_df (pd.DataFrame) –

list of must-link record pairs

record_id_1

unique identifier for the first record in the pair

record_id_2

unique identifier for the second record in the pair

blockstring_1

blockstring for the first record in the pair

blockstring_2

blockstring for the second record in the pair

drop_from_nm_1

flag, 1 if the first record in the pair was not eligible for matching

drop_from_nm_2

flag, 1 if the second record in the pair was not eligible for matching

existing

flag, 1 if the pair is must-link because of ExistingID

Returns:

list of must-link blockstring pairs (where both record have drop_from_nm == 0)

blockstring_1

blockstring for the first record in the pair

blockstring_2

blockstring for the second record in the pair

Return type:

pd.DataFrame

namematch.generate_data_rows

class namematch.generate_data_rows.GenerateDataRows(params, schema, output_dir, all_names_file, candidate_pairs_file, *args, **kwargs)[source]

Bases: NamematchBase

property output_files
main(**kw)[source]

Take candidate pairs and merge on the all-names records (twice) to get a dataset at the record pair level. Compute distance metrics between the records in the pair – these are the features for modeling.

Parameters:
  • params (Parameters object) – contains parameter values

  • schema (Schema object) – contains match schema info (files to match, variables to use, etc.)

  • all_names_file (str) – path to the all-names file

  • candidate_pairs_file (str) – path to the candidate-pairs file

  • output_dir (str) – path to the data-rows dir

generate_name_probabilities_object(an, fn_col=None, ln_col=None, **kw)[source]

The generate_name_probabilites function uses a list of names (from all_names file) and creates an object containing queryable probability information for each name.

Parameters:
  • an (pd.DataFrame) – all-names, just the name columns

  • fn_col (str) – name of first name column

  • ln_col (str) – name of last name column

Returns:

name probability object

find_valid_training_records(an, an_match_criteria, **kw)[source]
generate_actual_data_rows(params, schema, sbs_df, np_object, first_iter)[source]

Create modeling dataframe by comparing each variable (via numerous distance metrics).

Parameters:
  • params (Parameters object) – contains matching parameters

  • schema (Schema object) – contains matching schema (data files and variables)

  • sbs_df (pd.DatFrame) –

    side-by-side table (record pair level, with info from both an records)

    record_id (_1, _2)

    unique record identifier

    blockstring (_1, _2)

    concatenated version of blocking columns (sep by ::)

    file_type (_1, _2)

    either “new” or “existing”

    candidate_pair_ix

    index from candidate-pairs list

    covered_pair

    flag, 1 if blockstring pair passed blocking 0 otherwise

    <fields for matching> (_1, _2)

    both for the matching model and for constraint checking

  • np_object (nm_prob.NameProbability object) – contains information about name probabilities

Returns:

chunk of the data-rows file

dr_id

unique record pair identifier (record_id_1__record_id_2)

record_id (_1, _2)

unique record identifiers

<distance metrics>

how similar are the different matching fields between recrods

label

”1” if the records refer to the same person, “0” if not, “” otherwise

Return type:

pd.DataFrame

generate_data_row_files(params, schema, an, cp_df, name_probs, start_ix_worker, end_ix_worker, dr_file, **kw)[source]

The get_data_row_files function is run in parallel to generate the data needed for the random forest; it performs the merge between candidate pairs and all-names and calls the function that calculates distance metrics.

Parameters:
  • params (Parameters object) – contains matching parameters

  • schema (Schema object) – contains matching schema (data files and variables)

  • an (pd.DatFrame) –

    all-names table (one row per input record)

    record_id

    unique record identifier

    file_type

    either “new” or “existing”

    <fields for matching>

    both for the matching model and for constraint checking

    <raw name fields>

    pre-cleaning version of first and last name

    blockstring

    concatenated version of blocking columns (sep by ::)

    drop_from_nm

    flag, 1 if met any “to drop” criteria 0 otherwise

  • cp_df (pd.DataFrame) –

    candidate-pairs list

    blockstring_1

    concatenated version of blocking columns for first element in pair (sep by ::)

    blockstring_2

    concatenated version of blocking columns for second element in pair (sep by ::)

    covered_pair

    flag; 1 for pairs that made it through blocking, 0 otherwise; all 1s here

  • name_probs (nm_prob.NameProbability object) – contains information about name probabilities

  • start_ix_worker (int) – starting index of the candidate-pairs chunk to read in this thread

  • end_ix_worker (int) – end index of the candidate-pairs chunk to read in this thread

  • dr_file (str) – path to data-rows file to write (one for each worker thread)

namematch.fit_model

class namematch.fit_model.FitModel(params, all_names_file, data_rows_dir, model_info_file, output_dir, trained_model_info_file='None', selection_model_name='basic_selection_model.pkl', match_model_name='basic_match_model.pkl', flipped0_file=None, *args, **kwargs)[source]

Bases: NamematchBase

Parameters:
  • params (Parameters object) – contains parameter values

  • all_names_file (str) – path to the all-names file

  • data_rows_dir (str) – path to the data-rows dir

  • model_info_file (str) – path to the model info yaml file

  • output_dir (str) – path to the model dir

  • traiend_model_info_file (str) – path to the model info yaml file of a previously trained model

  • selection_model_name (str) – selection model name

  • match_model_name (str) – match model name

  • flipped0_file (str) – flipped0 file path

property output_files
property dr_file_list
main(**kw)[source]

Train and evaluate random foreset model(s). Depending on the settings, this might involved training and evaluating multiple types of models (e.g. selection and match models) and/or models for different data-row types (e.g. basic and no-dob).

fit_model(df, vars_to_exclude, outcome, weights=None, n_jobs=1, **kw)[source]

Fit random forest model.

Parameters:
  • df (pd.DataFrame) – data rows, subset to training rows

  • vars_to_exclude (list) – variables to disallow from the model

  • outcome (string) – name of the column that we’re predicting

  • weights (list) – sample weights to use for training (can be None)

  • n_jobs (int) – number of jobs to run in parallel

Returns:

tuple containing:

  • mod (sklearn.ensemble.RandomForestClassifier): trained sklearn random forest model object

  • feature_info (pd.DataFrame): feature_importance

Return type:

tuple

fit_models(train_df, model_type, model_info)[source]

Fit random forest model.

Parameters:
  • train_df (pd.DataFrame) – data rows, subset to training rows

  • model_type (string) – either “selection” or “match”

  • model_info (dict) – dict with information about how to fit the model

Returns:

maps model name (e.g. basic or no_dob) to a trained model object

Return type:

dict

evaluate_models(phats_df, outcome, model_type, weight_using_selection_model=False, default_threshold=0.5, missingness_model_threshold_boost=0.2, optimize_threshold=False, fscore_beta=1.0, **kw)[source]
get_train_eval_data(an_train_eligible_dict, model_info, params, model_type, any_train=True)[source]

Load data-rows, filter to rows that are eligible for training a givem model type, and then split the data into a training set and a labeled evaulation set.

Parameters:
  • an_train_eligible_dict (dict) – maps record_id to flag indicating record’s all-names based training eligibility

  • model_info (dict) – dict with information about how to fit the model

  • params (Parameters object) – contains parameter values

  • model_type (str) – either “selection” or “match”

  • any_train (bool) – True if you want training data (e.g. not a pre-trained model), False otherwise

Returns:

data rows, filtered to training data (excluding labeled eval data) pd.DataFrame: data rows, filtered to labeled eval data float: share of data rows that are labeled

Return type:

pd.DataFrame

find_valid_training_records(an, an_match_criteria)[source]

Identify records that meet the all-names criteria for training data.

Parameters:
  • an (pd.DataFrame) –

    all-names table (one row per input record)

    record_id

    unique record identifier

    file_type

    either “new” or “existing”

    <fields for matching>

    both for the matching model and for constraint checking

    <raw name fields>

    pre-cleaning version of first and last name

    blockstring

    concatenated version of blocking columns (sep by ::)

    drop_from_nm

    flag, 1 if met any “to drop” criteria 0 otherwise

  • an_match_criteria (dict) – keys are all-names columns, mapped to acceptable values

Returns:

flag, 1 if the record is eligible for training set 0 otherwise

Return type:

pd.Series

namematch.fit_model.get_feature_info(pipeline, raw_num_cols, raw_cat_cols)[source]

Extract the feature importance information from a sklearn model pipeline.

Parameters:
  • pipeline (skleran fitted pipeline) – trained model

  • raw_num_cols (list) – numeric columns that went into the model (before pipeline processing)

  • raw_cat_cols (list) – categorical columns that went into the model (before pipeline processing)

Returns:

feature importance information

feature

name of the feautre

importance

relative importance of this feature to the model

Return type:

pd.DataFrame

namematch.fit_model.save_models(selection_models, match_models, model_info)[source]

Save the models to file.

Parameters:
  • selection_models (dict) – maps model name (e.g. basic or no-dob) to a fit match model object

  • match_models (dict) – maps model name (e.g. basic or no-dob) to a fit selection model object

  • model_info (dict) – dict with information about how to fit the model

namematch.fit_model.define_necessary_models(dr_file_list, output_dir, missing_mod_field=None, selection_model_name='basic_selection_model.pkl', match_model_name='basic_match_model.pkl')[source]

Determine the different models needed (using a sample) and define the characteristics of data that determine which model should handle it.

NOTE: Right now, there is an assumption that the training universe

is the same between all models (i.e. basic and missingness)

Parameters:
  • dr_file_list (list) – list of paths to all data row files

  • output_dir (str) – model output folder path

  • missing_mod_field (str or None) – field that could trigger need for separate model

Returns:

mapping the name of a model (str) to a dict of the following information:

  • selection_model_name (str)

  • match_model_name (str)

  • type (str): one of “default” or “missingness”

  • actual_phat_universe (dict): maps a variable name to a value(?)

  • vars_to_exclude (str list)

  • match_thresh (float): threshold for match/nonmatch

Return type:

dict

namematch.fit_model.load_and_save_trained_model(trained_model_info_file, output_file)[source]

Load a set of pre-trained models and copy them to the current run’s output directory. Typically only used in incremental runs.

Parameters:
  • trained_model_info_file (str) – path to a model yaml file, which has path/threshold/universe info

  • output_file (str) – path to output the current run’s model yaml file (for copying)

Returns:

maps model name (e.g. basic or no_dob) to a trained model object

Return type:

dict

namematch.fit_model.get_match_train_eligible_flag(df, dr_train_eligible_conditions_dict, an_train_eligible_dict)[source]

Determine if a data-row is eligible for training (for match models), according to both all-names eligibility criteria and data-row eligibility criteria.

Parameters:
  • df (pd.DataFrame) – portion of data-rows file, limited to labeled rows

  • dr_train_eligible_conditions_dict (dict) – contains data-row training eligibility criteria

  • an_train_eligible_dict (dict) – maps record_id to flag indicating record’s all-names based training eligibility

Returns:

flag, 1 if data-row is training eligible (for match models)

Return type:

pd.Series

namematch.fit_model.add_threshold_dict(model_info, thresholds_dict)[source]

Add threshold information to the model_info dict, once it’s been determined.

Parameters:
  • model_info (dict) – dict with information about how to fit the model

  • thresholds_dict (dict) – keys are model name (e.g. basic, no-dob), values are optimized thresholds

Returns:

model dict, now with threshold info

Return type:

dict

namematch.fit_model.get_flipped0_potential_edges(phats_df, model_info, allow_clusters_w_multiple_unique_ids)[source]

If allowed, identify the set of labeled 0s with high phats so they can be treated as matches downstream.

Parameters:
  • phats_df (pd.DataFrame) –

    phat info for record pairs

    record_id (_1, _2)

    unique record identifiers

    model_to_use

    based on pair characteristics, which model to use (e.g. basic or no-dob)

    covered_pair

    did the pair make it through blocking

    match_train_eligible

    is the pair eligible for training (for match model)

    exactmatch

    is the pair an exact match on name/dob

    label

    whether the pair is a match or not

    <phat_col>

    predicted probability of match

  • model_info (dict) – dict with information about how to fit the model

  • allow_clusters_w_multiple_unique_ids (bool) – param controlling if 0s can be flipped to 1

Returns:

same as phats_df, just

Return type:

pd.DataFrame

namematch.predict

class namematch.predict.Predict(params, data_rows_dir, model_info_file, output_dir, *args, **kwargs)[source]

Bases: NamematchBase

Parameters:
  • params (Parameters object) – contains parameter values

  • model_info_file (str) – path to the data-rows dir

  • data_rows_dir (str) – path to the model info yaml file for a trained model

  • output_dir (str) – path to the potential-links dir

property output_files
property dr_file_list
main(**kw)[source]

Read in data-rows and predict (in parallel) for each unlabeled pair. Output the pairs above the threshold.

get_potential_edges(dr_file, match_models, model_info, output_dir, params, **kw)[source]

Read in data rows in chunks and predict as needed. Write (append) the edges above the threshold to the appropriate file.

Parameters:
  • dr_file (string) – path to data file to predict for

  • match_models (dict) – maps model name (e.g. basic or no-dob) to a fit match model object

  • model_info (dict) – contains information about threshold

  • output_dir (str) – directory to place potential links

  • params (Parameters obj) – contains parameter values (i.e. use_uncovered_phats)

get_potential_edges_in_parallel(match_models, model_info, output_dir, params)[source]

Dispatch the worker threads that will predict for unlabeled pairs in paralle.

Parameters:
  • match_models (dict) – maps model name (e.g. basic or no-dob) to a fit match model object

  • model_info (dict) – dict with information about how to fit the model

  • output_dir

  • params (Parameters object) – contains parameter values

classmethod predict(models, df, model_type, oob=False, all_cols=False, all_models=True, prob_match_train=None)[source]

Use the trainined models to predict for pairs of records.

Parameters:
  • models (dict) – maps model name (e.g. basic or no-dob) to a fit match model object

  • df (pd.DataFrame) – portion of the data-rows table, with a “model_to_use” column appended

  • model_type (str) – model type (e.g. selection or match)

  • oob (bool) – if True, use the out-of-bag predictions

  • all_cols (bool) – if True, keep all columns in the output df; not just the relevant ones

  • all_models (bool) – if True, predict for each row using all models, not just the “model to use”

  • prob_match_train (float) – share of data-rows that are labeled

namematch.cluster

class namematch.cluster.Constraints[source]

Bases: object

property get_columns_used
property is_valid_cluster
class namematch.cluster.Cluster(params, schema, must_links_file='must_links.parquet', potential_edges_dir='potential_links', flipped0_edges_file='flipped0_potential_links.csv', all_names_file='all_names.parquet', cluster_assignments='cluster_assignments.pkl', edges_to_cluster='edges_to_cluster.parquet', constraints: str | Constraints | None = None, *args, **kwargs)[source]

Bases: NamematchBase

Parameters:
  • params (Parameters object) – contains parameter values

  • schema (Schema object) – contains match schema info (files to match, variables to use, etc.)

  • constraints (str or Constrants object) – either a path to python script defining constraint functions or a Constraints object

  • must_links_file (str) – path to the must-links file

  • potential_edges_dir (str) – path to the potential-links dir in the output/details folder

  • flipped0_edges_file (str) – path to the flipped-links file

  • all_names_file (str) – path to the all-names file

  • cluster_assignments (str) – path to the cluster-assignments file

property output_files
main(**kw)[source]

Read the record pairs with high probability of matching and connect them in a way that doesn’t violate any logic constraints to form clusters.

get_cluster_logic(constraints)[source]
auto_is_valid_edge(edges_df, uid_cols, allow_clusters_w_multiple_unique_ids, leven_thresh, eid_col=None)[source]

Check if two records would violate a unique id or existing id constraint.

Parameters:
  • edges_df (pd.DataFrame) –

    potential edges information

    record_id_1

    unique record identifier (for first in pair)

    record_id_2

    unique record identifier (for second in pair)

    phat

    predicted probability of a record pair being a match

    original_order

    original ordering 1-N (useful so gt is always on top of phat=1 cases)

  • uid_cols (list) – all-names column(s) with compare_type UniqueID

  • allow_clusters_w_multiple_unique_ids (bool) – True if a cluster can have multiple uid values

  • leven_thresh (int) – n character edits to allow between uids before they’re considered different

  • eid_col (str) – all-names column with compare_type ExistingID (None for non-incremental runs)

Returns:

potential edges information, but limited to rows that pass the automated validity check

Return type:

valid_edges_df

auto_is_valid_cluster(cluster, uid_cols, allow_clusters_w_multiple_unique_ids, leven_thresh, eid_col=None)[source]

Check if a proposed cluster would violate a unique id or existing id constraint.

Parameters:
  • cluster (pd.DataFrame) – all-names file (relevant columns only) records for the proposed cluster

  • uid_cols (list) – all-names column(s) with compare_type UniqueID

  • allow_clusters_w_multiple_unique_ids (bool) – True if a cluster can have multiple uid values

  • leven_thresh (int) – n character edits to allow between uids before they’re considered different

  • eid_col (str) – all-names column with compare_type ExistingID (None for non-incremental runs)

Returns:

False if an automated constraint is violated

Return type:

bool

get_initial_clusters(must_links_df, an_df, eid_col, **kw)[source]

Use must links (ground truth and/or a previous run) to create the starting clusters.

Parameters:
  • must_links_df (pd.DataFrame) –

    record pairs that must be linked together no matter what

    record_id_1

    unique identifier for the first record in the pair

    record_id_2

    unique identifier for the second record in the pair

    blockstring_1

    blockstring for the first record in the pair

    blockstring_2

    blockstring for the second record in the pair

    drop_from_nm_1

    flag, 1 if the first record in the pair was not eligible for matching

    drop_from_nm_2

    flag, 1 if the second record in the pair was not eligible for matching

    existing

    flag, 1 if the pair is must-link because of ExistingID

  • an_df (pd.DataFrame) –

    all-names file, with only the columns relevant for clustering

    record_id

    unique record identifier

    <uid column(s)>

    columns with compare_type UniqueID

    <eid column(s)>

    columns with compare_type ExistingID

    <user-constraint column(s)>

    (optional) columns mentioned in get_columns_used()

  • eid_col (str) – all-names column with compare_type ExistingID, or None

Returns:

clusters maps a cluster id to a list of record ids dict: cluster_assignments maps a record_id to a cluster_id set: cluster ids that are already in use (only for incremental)

Return type:

dict

save_df_to_disk(df)[source]
get_potential_edges(potential_edges_files, flipped0_edges_file, gt_1s_df, cluster_logic, cluster_info, uid_cols, eid_col, **kw)[source]

Use all predictions file to make a list of edges that the constrained clustering algorithm should try to add.

Parameters:
  • potential_edges_files (list) – paths to the potential links files

  • flipped0_edges_file (str) – path to the flipped0-links file

  • gt_1s_df (pd.DataFrame) – known y=1s; will be matched, pending the edge/cluster validity

  • cluster_logic (module) – user-defined constraint functions

  • cluster_info (pd.DataFrame) – all-names file, with only the columns relevant for clustering

  • uid_cols (list) – all-name columns with compare_type UniqueID

  • eid_col (str) – all-name column with compare_type ExistingID

load_cluster_info(all_names_file, uid_cols, eid_col, cluster_logic, **kw)[source]

Read in the all_names information needed for cluster constraint checking. Columns defined in the config as compare type UniqueID or ExistingID will automatically be loaded (as strings, with missing values represented as NA). Other columns you wish to be loaded should be defined in the user-defined get_columns_used() function.

Parameters:
  • all_names_file (str) – path to the all-names file

  • uid_cols (list) – all-name columns with compare_type UniqueID

  • eid_col (str) – all-name column with compare_type ExistingID

  • cluster_logic (module) – user-defined constraint functions

Returns:

all-names file, with only the columns relevant for clustering

record_id

unique record identifier

<uid column(s)>

columns with compare_type UniqueID

<eid column(s)>

columns with compare_type ExistingID

<user-constraint column(s)>

(optional) columns mentioned in get_columns_used()

Return type:

pd.DataFrame

get_ci_ix_map(cluster_info)[source]
cluster_potential_edges(clusters, cluster_assignments, original_cluster_ids, cluster_info, cluster_logic, uid_cols, eid_col, **kw)[source]

For clusters by add potential edges to the cluster graph in order of importance, skipping those that cause violations.

Parameters:
  • clusters (dict) – maps a cluster id to a list of record ids – post initialization

  • cluster_assignments (dict) – maps a record_id to a cluster_id – post initialization

  • original_cluster_ids (set) – set: cluster ids that are already in use (only for incremental)

  • cluster_info (pd.DataFrame) –

    all-names file, with only the columns relevant for clustering

    record_id

    unique record identifier

    <uid column(s)>

    columns with compare_type UniqueID

    <eid column(s)>

    columns with compare_type ExistingID

    <user-constraint column(s)>

    (optional) columns mentioned in get_columns_used()

  • potential_edges (deque) – each element is a dict version of a potential edge’s record

  • cluster_logic (module) – user-defined constraint functions

  • uid_cols (list) – all-name columns with compare_type UniqueID

  • eid_col (str) – all-name column with compare_type ExistingID

Returns:

maps record_id to cluster_id

Return type:

dict

namematch.generate_output

class namematch.generate_output.GenerateOutput(params, schema, all_names_file, cluster_assignments_file, an_output_file, output_dir, output_file_uuid=None, *args, **kwargs)[source]

Bases: NamematchBase

Parameters:
  • params (Parameters object) – contains parameter values

  • schema (Schema object) – contains match schema info (files to match, variables to use, etc.)

  • all_names_file (str) – path to the all-names file

  • cluster_assignments_file (str) – path to the cluster-assignments file

  • an_output_file (str) – path to the all-names-with-clusterid file

  • output_dir (str) – path to final output directory

property output_files
main(**kw)[source]

Read in the cluster assignments dictionary and use it to create all-names-with-cluster-id and the “with-cluster-id” versions of input dataset.

create_allnames_clusterid_file(all_names_file, cluster_assignments, cleaned_col_names, **kw)[source]

Create all-names-with-clusterid dataframe.

Parameters:
  • all_names_file (str) – path to the all-names file

  • cluster_assignments (dict) – maps record_id to cluster_id

  • cleaned_col_names (list) – all-name columns used in cosine blocking

Returns:

all-names-with-cluster-id

record_id

unique record identifier

file_type

either “new” or “existing”

<fields for matching>

both for the matching model and for constraint checking

blockstring

concatenated version of blocking columns (sep by ::)

drop_from_nm

flag, 1 if met any “to drop” criteria 0 otherwise

cluster_id

unique person identifier, no missing values

Return type:

pd.DataFrame

output_clusterid_files(data_files, cluster_assignments, output_dir, output_file_uuid=None, **kw)[source]

For each input file, construct a matching output file that has the cluster_id column, and write it.

Parameters:
  • data_files (list of DataFile objects) – contains info about each input file

  • cluster_assignments (dict) – maps record_id to cluster_id

  • output_dir (str) – the path that was supplied when the name match object was created

namematch.utils.utils

class namematch.utils.utils.StatLogFilter[source]

Bases: object

filter(logRecord)[source]
namematch.utils.utils.setup_logging(log_params, log_filepath, output_temp_dir, filter_stats=False, logging_level='INFO')[source]

Setup logging configuration.

Parameters:
  • log_params (dict) – contains info for logging setup

  • log_filepath (str) – path to store logs

namematch.utils.utils.log_stat(human_desc, yaml_desc, value)[source]

Log a statistic in the log and in the stats yaml.

Parameters:
  • human_desc (str) – human readable description of the stat (could be a phrase)

  • yaml_desc (str) – concise yaml-key compatible description of the stat

  • value (float or str) – value of the stat

namematch.utils.utils.log_runtime_and_memory(method)[source]

Decorator that logs time to execute functions and records max memory usage in GB.

Parameters:

method (function) – function to measure/log runtime and memory usage

Returns:

value returned by the function being decorated

namematch.utils.utils.load_yaml(yaml_file)[source]

Load a yaml file into a dictionary.

Parameters:

yaml_file (str) – path to yaml file

Returns:

dictionary version of input yaml file

Return type:

dict

namematch.utils.utils.dump_yaml(dict_to_write, yaml_file)[source]

Write a dictionary into a yaml file.

Parameters:
  • dict_to_write (dict) – dict to write to yaml

  • yaml_file (str) – path to output yaml file

namematch.utils.utils.to_dict(obj)[source]

Convert an object (i.e. instance of a user-defined class) into a dictionary to make writing easier.

Parameters:

obj (object) – class instance to convert to dict

namematch.utils.utils.create_nm_record_id(nickname, record_id_series)[source]
namematch.utils.utils.clean_nn_string(n)[source]

Removes JR, SR, II, extra spaces, etc. from nn strings. The original string in the dataframe keeps punctuation and suffixes.

Parameters:

n (str) – raw name value

Returns:

clean version of the input name

Return type:

str

namematch.utils.utils.build_blockstring(df, blocking_scheme, incl_ed_string=True)[source]

Create blockstrings (values for blocking separated by ::, such as JOHN::SMITH::1993-07-23) from all-names data.

Parameters:
  • df (pd.DataFrame) –

    all-names table

    record_id

    unique record identifier

    file_type

    either “new” or “existing”

    <fields for matching>

    both for the matching model and for constraint checking

    drop_from_nm

    flag, 1 if met any “to drop” criteria 0 otherwise

  • blocking_scheme (dict) – contains info about fields to block on

  • incl_ed_string (bool) – True if the blockstring should end with the edit-distance string (e.g. dob)

Returns:

blockstrings

Return type:

pd.Series

namematch.utils.utils.get_nn_string_from_blockstring(blockstring)[source]

Parse out the near-neighbor string (e.g. first-name and last-name) from a blockstring.

Parameters:

blockstring (str) – string with info for blocking (e.g. JOHN::SMITH::1993-07-23)

Returns:

near-neighbor string (e.g. JOHN::SMITH)

Return type:

str

namematch.utils.utils.get_ed_string_from_blockstring(blockstring)[source]

Parse out the edit-distance string (e.g. dob) from a blockstring.

Parameters:

blockstring (str) – string with info for blocking (e.g. JOHN::SMITH::1993-07-23)

Returns:

edit-distance string (e.g. 1993-07-23)

Return type:

str

namematch.utils.utils.get_endpoints(n, num_chunks)[source]

Divide a number into some number of chunks/intervals.

Parameters:
  • n (int) – number to divide into chunks/intervals

  • num_chunks (int) – number of chunks/intervals to create

Returns:

list of start and end points to cover entire range

Return type:

list of int tuples

namematch.utils.utils.load_sample(csv_path, pct, cols=None)[source]

Load a random sample of a csv into pandas.

Parameters:
  • csv_path (str) – path to csv file

  • pct (float) – what percent of the file to randomly read

  • cols (list) – columns to load

Returns:

random subset of the input csv

Return type:

pd.DataFrame

namematch.utils.utils.load_csv_list(df_file_list, cols=None, conditions_dict={}, sample=1)[source]

Read a list of .csv files into a single pd.DataFrame.

Parameters:
  • df_file_list (list of str) – list of .csv files to read

  • cols (list) – columns to keep in the dataframe

  • conditions_dict (dict) – conditions for row filtering

  • sample (float) – share of rows to randomly sample from the final dataframe

Returns:

filtered sampled dataframe read in from the .csv files

Return type:

pd.DataFrame

namematch.utils.utils.load_parquet(df_file, cols=None, conditions_dict={})[source]

Read a .parquet file into a pd.DataFrame.

Parameters:
  • df_file (str) – .parquet file to read

  • cols (list) – columns to keep in the dataframe

  • conditions_dict (dict) – conditions for row filtering

Returns:

filtered dataframe read in from the .parquet file

Return type:

pd.DataFrame

namematch.utils.utils.load_parquet_list(df_file_list, cols=None, conditions_dict={}, sample=1)[source]

Read a list of .parquet files into a single pd.DataFrame.

Parameters:
  • df_file_list (list of str) – list of .parquet files to read

  • cols (list) – columns to keep in the dataframe

  • conditions_dict (dict) – conditions for row filtering

  • sample (float) – share of rows to randomly sample from the final dataframe

Returns:

filtered sampled dataframe read in from the .parquet files

Return type:

pd.DataFrame

namematch.utils.utils.determine_model_to_use(dr_df, model_info, verbose=False)[source]

Assign a model to each data row based on which fields are available.

Parameters:
  • dr_df (pd.DataFrame) –

    data rows

    record_id_1

    unique identifier for the first record in the pair

    record_id_2

    unique identifier for the second record in the pair

    <distance metric fields>

    distance metrics between the two records’ matching fields

    label

    flag, “1” if the records are a match, “0” if not, “” if unknown

  • model_info (dict) – information about models and their universes

  • verbose (bool) – flag controlling logging statement (set according to which function calls this one)

Returns:

string indicating which model to use for a given record pair

Return type:

pd.Series

namematch.utils.utils.load_models(model_info_file, selection=False)[source]

Load pre-trained models (selection and match, as available)

Parameters:
  • model_info_file (str) – path to original model config

  • selection (bool) – if True, try to load a corresponding selection model

Returns:

maps model name (e.g. basic or no-dob) to a fit model object dict: dict with information about how to fit the model

Return type:

dict

namematch.utils.utils.recursively_convert_tuple_to_list(value)[source]
namematch.utils.utils.luigi_dict_parameter_to_dict(d)[source]
namematch.utils.utils.filename_friendly_hash(inputs)[source]
namematch.utils.utils.load_logging_params(logging_params_file=None)[source]
namematch.utils.utils.reformat_dict(d: dict)[source]

make all the string values in the yaml file have double quotes

namematch.utils.utils.camel_to_snake(name)[source]

namematch.generate_report

class namematch.generate_report.IgnoreBlackWarning(name='')[source]

Bases: Filter

Initialize a filter.

Initialize with the name of the logger which, together with its children, will have its events allowed through the filter. If no name is specified, allow every event.

filter(record)[source]

Determine if the specified record is to be logged.

Returns True if the record should be logged, or False otherwise. If deemed appropriate, the record may be modified in-place.

class namematch.generate_report.GenerateReport(params, schema, report_file, *args, **kwargs)[source]

Bases: NamematchBase

params (Parameters object): contains parameter values schema (Schema object): contains match schema info (files to match, variables to use, etc.) report_file (str): full path of the report html file

property output_files
main(**kw)[source]

Main method for each task class which is called in namematcher.NameMatcher