API Reference
namematch.namematcher
- class namematch.namematcher.NameMatcher(config: dict | None = None, default_params: dict | None = None, og_blocking_index_file: str = 'None', trained_model_info_file: str = 'None', nm_info_file: str = 'nm_info.yaml', log_file_name: str | None = None, logging_params_file: str | None = None, output_dir: str = 'output', output_temp_dir: str | None = None, all_names_file: str = 'all_names.parquet', must_links: str = 'must_links.parquet', blocking_index_bin_file: str = 'blocking_index.bin', candidate_pairs_file: str = 'candidate_pairs.parquet', data_rows_dir: str = 'data_rows', selection_model_name: str = 'basic_selection_model.pkl', match_model_name: str = 'basic_match_model.pkl', flipped0_file: str = 'flipped0_potential_links.csv', model_dir: str = 'model', model_info_file: str = 'model.yaml', potential_edges_dir: str = 'potential_links', cluster_assignments: str = 'cluster_assignments.pkl', edges_to_cluster: str = 'edges_to_cluster.parquet', constraints: str | Constraints | None = None, an_output_file: str = 'all_names_with_clusterid.csv', report_file: str = 'matching_report.html', enable_lprof: bool = False, logging_level: str = 'INFO', params=None, schema=None)[source]
Bases:
object
Main interface to run all the steps in namematch
- property process_input_data
- property generate_must_links
- property block
- property generate_data_rows
- property fit_model
- property predict
- property cluster
- property generate_output
- property generate_report
- property all_tasks
- property nm_metadata
Namematch state including all the necessary attributes to recreate the NameMatcher object
- run(force=False, write_params_schema_file=True, write_stats_file=True)[source]
Main method to kick off the namematch process
- Parameters:
force (bool) – Force to run all the tasks
write_params_schema_file (bool) – whether to write params and schema to yaml
write_stats_file (bool) – whether to write the nm_info file
- classmethod load_namematcher(nm_info_file_path, new_nm_info_file=None, **kwargs)[source]
To load NameMatcher instance given the nm_info_file. This classmethod help us pick up where it left last time based on the information in the nm_info_file. It will create a NameMatcher instance and recover all the attributes as well as stats_dict for tasks that was already run.
- Parameters:
nm_info_file (str) – nm_info.yaml file path
- Returns:
NameMatcher instance
- Return type:
namematch.data_structures.parameters
- class namematch.data_structures.parameters.Parameters(validate_param_dict)[source]
Bases:
object
Class that houses important matching parameters. Handles validation of the config file.
- static check_integrity(defaults, param, param_value)[source]
Ensure that parameters are of the appropriate type.
- Parameters:
param (str) – parameter name (key)
param_value – value of the given key (parameter name)
- classmethod init(config: dict, defaults: dict)[source]
Create a Parameters instance.
- Parameters:
config (dict) – dictionary with match parameter values
defaults (dict) – dictionary with default params
- Returns:
instance of the Parameters class
- Return type:
- classmethod load(filepath)[source]
Load a Parameters instance.
- Parameters:
filepath (str) – path to a yaml version of a Parameters instance
- Returns:
instance of the Parameters class
- Return type:
- check_for_required_variables(variables)[source]
Validate that the config includes required variables.
- validate_exactmatch_variables(variables)[source]
Validate that the exact_match_variables and negate_exact_match_variables parameters.
- validate_blocking_scheme(variables)[source]
Validate that the blocking scheme is in the correct format and provieds the minimum number of blocking variables per blocking type (cosine_distance, edit_distance, absvalue_distance).
- get_blocking_variables()[source]
Get list of blocking variable nicknames.
- Returns:
list of variable nicknames (all-names columns) to use for blocking
namematch.data_structures.schema
- class namematch.data_structures.schema.Schema(data_files, variables)[source]
Bases:
object
Class that houses the most essential instructions for how to complete the match: what data files to match, and which variables to use to do so.
- classmethod init(config, params)[source]
Create and validate a DataFileList instance and a VariableList instance.
- Parameters:
config (dict) – dictionary with match parameter values
params (dict) – dictionary with processed match parameter values
- Returns:
instance of the Schema class
- Return type:
namematch.data_structures.data_file
- class namematch.data_structures.data_file.DataFile(validated_data_file_dict)[source]
Bases:
object
Parent class for NewDataFile and ExistingDateFile, which house details about the data files input for matching.
- classmethod load(data_file_dict)[source]
Load a DataFile instance (either a NewDataFile or an ExistingDataFile).
- Parameters:
data_file_dict (dict) – dictionary-version of a DataFile object
- Returns:
instance of the DataFile class
- class namematch.data_structures.data_file.NewDataFile(validated_data_file_dict)[source]
Bases:
DataFile
- class namematch.data_structures.data_file.ExistingDataFile(validated_data_file_dict)[source]
Bases:
DataFile
- class namematch.data_structures.data_file.DataFileList(data_files_dict)[source]
Bases:
object
Class that houses a list of DataFile objects (either NewDataFiles or ExistingDateFiles.
- classmethod build(data_files_dict, existing_data_files_dict)[source]
Create a DataFileList instance.
- Parameters:
data_files_dict (dict) – dictionary with “new data file” info from user-input config
existing_data_files_dict (dict) – dictionary with “existing data file” info from user-input config
- Returns:
instance of the DataFileList class
- Return type:
- classmethod load(data_files_list_dict)[source]
Load a DataFileList instance.
- Parameters:
data_files_list_dict (dict) – dictionary-version of a DataFileList object
- Returns:
instance of the DataFileList class
- Return type:
- get_all_nicknames()[source]
Return a list of all of the DataFile nicknames in the DataFileList.
- Returns:
list of strings
- validate()[source]
Validate the DataFileList by validating the list overall and then validating each individual DataFile.
- validate_names()[source]
Validate that the DataFiles in the DataFileList all have unique nicknames and that the output file stems have unique cluster types.
namematch.data_structures.variable
- class namematch.data_structures.variable.Variable(validated_variable_dict)[source]
Bases:
object
- classmethod build(variable_dict, params)[source]
Create a Variable instance.
- Parameters:
variable_dict (dict) – info about a variable definition from user-input config
params (dict) – dictionary with processed match parameter values
- Returns:
instance of the Variable class
- Return type:
- validate_col_parameters(data_files)[source]
Validate that each data file has a corresponding “_col” parameter in each variable defintion.
- Parameters:
data_files (
namematch.data_structures.data_file.DataFileList
) – info about what input files are being matched
- get_columns_to_read(file_nickname)[source]
Get the name(s) of the column(s) from the input file that need to be read in order to create the current all-names column.
- Parameters:
file_nickname (str) – nickname of input file being searched
- Returns:
list of column names
- class namematch.data_structures.variable.VariableList(variable_list)[source]
Bases:
object
Class that houses a list of Variable objects.
- classmethod build(variable_dict_list, params)[source]
Create a VariableList instance.
- Parameters:
variable_dict_list (dict) – dictionary with variable info from user-input config
params (dict) – dictionary with processed match parameter values
- Returns:
instance of the VariableList class
- Return type:
- classmethod load(variables_list_dict)[source]
Load a VariableList instance.
- Parameters:
variables_list_dict (dict) – dictionary version of a VariableList instance
- Returns:
instance of the VariableList class
- Return type:
- validate_col_parameters(data_files)[source]
Validate that the “_col” variables referenced in the config’s variable definitions actually exist in the input datasets.
- Parameters:
data_files (
namematch.data_structures.data_file.DataFileList
) – info about what input files are being matched
- validate_variable_names()[source]
Validate that the Variables in the VariableList all have unique nicknames.
- validate_type_counts(incremental)[source]
Validate that there is exactly one variable with compare type UniqueID. If incremental, validate that there is exactly one variable with compare type ExistingID.
- Parameters:
incremental (bool) – True if the config file provides “existing” data files
- validate(data_files)[source]
Validate several components of the variables defined in the config.
- Parameters:
data_files (
namematch.data_structures.data_file.DataFileList
) – info about what input files are being matched
- get_variables_where(attr, attr_value, equality_type='equals', return_type='name')[source]
Select variables that meet a certain condition (e.g. compare_type == ‘Category’).
- Parameters:
attr (str) – variable feature to condition on
attr_value (str) – acceptable values for the variable feature
equality_type (str) – check conditions using either “equals” or “in”
return_type (str) – either “name” or “ix”, determining what return type to use
- Returns:
list of variable nicknames (all-names columns) or corresponding all-names column indices
- get_columns_to_read(data_file)[source]
Get the name(s) of the column(s) from the input file that need to be read in order to create the all-names file.
- Parameters:
data_file (DataFile object) – contains info about a given input file
- Returns:
list of column names
- get_an_column_names()[source]
Get the final list of all-names columns, including internally created columns like file_type and drop_from_nm.
- Returns:
list of all-names columns
namematch.process_input_data
- class namematch.process_input_data.ProcessInputData(params, schema, all_names_file='all_names.parquet', *args, **kwargs)[source]
Bases:
NamematchBase
- property output_files
- main(**kw)[source]
Follow the instructions in the schema and params objects to build the all-names file from the raw input file(s).
- Parameters:
params (Parameters object) – contains parameter values
schema (Schema object) – contains match schema info (files to match, variables to use, etc.)
all_names_file (str) – path to the all-names file
- process_geo_column(df, variable)[source]
Take dataframe of geographic data (either in “lat,lon” format or in “lat”, “lon” format) and ensure it has just one column.
- Parameters:
df (pd.DataFrame) – df of address input data (columns are strings)
variable (Variable object) – contains naming info for new geo column
- Returns:
DataFrame of clean geographic information for all_names file
- Return type:
pd.Dataframe
- parse_address(address)[source]
Parse an address string into distinct parts.
- Parameters:
address (str) – string of full address (e.g. 54 East 18th Rd.)
- Returns:
(address number, street name, street suffix)
- Return type:
tuple
- process_address_column(df, logger=None)[source]
Take dataframe of address data (either in “123 Main St.” format or “123”, “Main”, “St.” format, order matters) and parse as needed to produce three clean columns: street number, street name, and street type.
- Parameters:
df (pd.DataFrame) – df of address input data
- Returns:
Dataframe of clean address information for all_names file
- Return type:
pd.DataFrame
- process_check(s, variable)[source]
Check the validity of the values in a given all-names column (according to the data type and config instructions) and set the series name correctly.
- Parameters:
s (pd.Series) – series to process (will be an all-names column)
variable (Variable object) – contains info on how to validate data in series
- Returns:
Processed series
- Return type:
pd.Series
- process_data(df, variables, data_file, params)[source]
Read in part of an input file and process it according to the config in order to create part of the all-names file.
- Parameters:
df (pd.DataFrame) – chunk of an input data file
variables (VariableList object) – contains info about the fields for matching (from config)
data_file (DataFile object) – contains info about the input data set
params (dict) – dictionary of param values
- Returns:
a chunk of the all-names table (one row per input record)
record_id
unique record identifier
file_type
either “new” or “existing”
<fields for matching>
both for the matching model and for constraint checking
<raw name fields>
pre-cleaning version of first and last name
blockstring
concatenated version of blocking columns (sep by ::)
drop_from_nm
flag, 1 if met any “to drop” criteria 0 otherwise
- Return type:
pd.DataFrame
- namematch.process_input_data.process_set_missing(s, set_missing_list)[source]
Set values in a series to missing as needed.
- Parameters:
s (pd.Series) – strings to process
set_missing_list (list) – list of strings that are disallowed
- Returns:
Processed series
- Return type:
pd.Series
- namematch.process_input_data.process_drop(s, drop_list)[source]
Get the records in a series that have invalid values.
- Parameters:
s (pd.Series) – series being processed
drop_list (list of str) – invalid values
- Returns:
Indices of records that are not valid
- Return type:
list
- namematch.process_input_data.process_auto_drops(an, existing_drop_list, drop_logic)[source]
Get the records in all-names that have invalid values due to combination of multiple columns (based on logic in the private config).
- Parameters:
an (pd.DataFrame) – all-names chunk being processed
existing_drop_list (list of str) – records already known to be invalid
drop_logic (list of dicts) – logic for what makes a record invalid
- Returns:
Indices of records that are not valid
- Return type:
list
namematch.generate_must_links
- class namematch.generate_must_links.GenerateMustLinks(params, schema, all_names_file, must_links, *args, **kwargs)[source]
Bases:
NamematchBase
- property output_files
- main(**kw)[source]
Generate the list of must-link pairs using UniqueID info .
- Parameters:
params (Parameters object) – contains parameter values
schema (Schema object) – contains match schema info (files to match, variables to use, etc.)
all_names_file (str) – path to the all-names file
must_links (str) – path to the must-links file
- build_ml_var_df(all_names_file, uid_vars_list, **kw)[source]
Load the all-names file and limit it to the rows that have a non-missing UniqueID value.
- Parameters:
all_names_file (str) – path to the all-names file
uid_vars_list (list of strings) – all-name columns with compare_type “UniqueID”
- Returns:
a subset of the all-names file, relevant colums only
record_id
unique record identifier
blockstring
concatenation of all the blocking variables (sep by ::)
drop_from_nm
flag, 1 if met any “to drop” criteria 0 otherwise
new_record
either True or False
<UniqueID column(s)>
variables of compare_type UniqueID
- Return type:
pd.DataFrame
- get_must_links(ml_var_df, uid_vars_list, **kw)[source]
Expand the list of records with must-link information to pairs of records that must be linked togehter in the final match.
- Parameters:
ml_var_df (pd.DataFrame) –
record_id
unique record identifier
blockstring
concatenation of all the blocking variables (sep by ::)
drop_from_nm
flag, 1 if met any “to drop” criteria 0 otherwise
new_record
either True or False
<UniqueID column(s)>
variables of compare_type UniqueID
uid_vars_list (list of strings) – all-name columns with compare_type “UniqueID”
- Returns:
list of must-link record pairs
record_id_1
unique identifier for the first record in the pair
record_id_2
unique identifier for the second record in the pair
blockstring_1
blockstring for the first record in the pair
blockstring_2
blockstring for the second record in the pair
drop_from_nm_1
flag, True if the first record in the pair was not eligible for matching
drop_from_nm_2
flag, True if the second record in the pair was not eligible for matching
- Return type:
pd.DataFrame
namematch.block
- class namematch.block.Block(params, schema, all_names_file='all_names.parquet', must_links_file='must_links.parquet', candidate_pairs_file='candidate_pairs.parquet', blocking_index_bin_file='blocking_index.bin', og_blocking_index_file='None', *args, **kwargs)[source]
Bases:
NamematchBase
- Parameters:
params (Parameters object) – contains matching parameter values
schema (Schema object) – contains match schema info (files to match, variables to use, etc.)
all_names_file (str) – path to the all-names file
must_links_file (str) – path to the must-links file
blocking_index_bin_file – name of blocking index file
og_blocking_index_file (str) – path to a pre-built nmslib index (optional, if doesn’t exist then None)
candidate_pairs_file (str) – path to the candidate-pairs file
- property output_files
- main(**kw)[source]
Generate the candidate-pairs list using the blocking scheme outlined in the config.
- split_last_names(df, last_name_column, blocking_scheme, **kw)[source]
Expand the processed all-names file to handle double last names (e.g. SAM SMITH-BROWN becomes SAM SMITH and SAM BROWN).
- Parameters:
df – all-names table, relevant columns only (where drop_from_nm == 0)
last_name_column (str) – clean last name column
blocking_scheme (dict) – dictionary with info on how to do blocking
- Returns:
more rows than input all names, plus orig_last_name and orig_record columns
- Return type:
pd.DataFrame
- convert_all_names_to_blockstring_info(an, absval_col, params, **kw)[source]
Create a table with information about blockstrings. If the split_names parameter is True, then this function expands double last names to create two new “records” (e.g. SAM SMITH-BROWN becomes SAM SMITH and SAM BROWN).
- Parameters:
an (pd.DataFrame) –
all-names table, relevant columns only (where drop_from_nm == 0)
record_id
unique record identifier
blockstring
concatenation of all the blocking variables (sep by ::)
file_type
either “new” or “existing”
drop_from_nm
flag, 1 if met any “to drop” criteria 0 otherwise
<nn-blocking column(s)>
variables for near-neighbor blocking
<ed-blocking column>
variable for edit-distance blocking
<av-blocking column>
(optional) variable for abs-value blocking
nn_string
concatenated version of nn-blocking columns (sep by ::)
ed_string
copy of ed-blocking column
absval_string
copy of abs-value-blocking column
absval_col (str) – column for absolute-value blocking
params (Parameter object) – contains matching parameters
- Returns:
tuple containing:
nn_string_info (pd.DataFrame): table with one row per nn_string (or expanded nn_string)
nn_string
concatenated version of nn-blocking columns (sep by ::)
commonness_penalty
float indicating how common the last name is
n_new
number of times this nn_string appears in a “new” record
n_existing
number of times this nn_string appears in an “existing” record
n_total
number of times this nn_string appears in any record
nn_string_expanded_df (pd.DataFrame): table with one row per blockstring (or expanded blockstring)
nn_string
concatenated version of nn-blocking columns (sep by ::)
nn_string_full
(optional) if split_names is True, this is the full (un-split) nn_string
ed_string
copy of ed-blocking column
absval_string
copy of abs-value-blocking column
- Return type:
tuple
- get_query_strings(nn_string_info, blocking_scheme)[source]
Filter to nn_strings that appear in the new data – these are the only strings for which we need near-neighbors. If incremental is False, this filtering step does nothing.
- Parameters:
nn_string_info (pd.DataFrame) –
table with one row per nn_string (or expanded nn_string)
nn_string
concatenated version of nn-blocking columns (sep by ::)
commonness_penalty
float indicating how common the last name is
n_new
number of times this nn_string appears in a “new” record
n_existing
number of times this nn_string appears in an “existing” record
n_total
number of times this nn_string appears in any record
blocking_scheme (dict) – dictionary with info on how to do blocking
- Returns:
tuple containing:
nn_string_info_to_query (pd.DataFrame): nn_string_info, subset to nn_strings where n_new > 0
nn_strings_to_query (list): nn_strings that appear at least once in a “new” record
shingles_to_query (scipy.sparse.csr_matrix): sparse weighted shingles matrix for the nn_strings that appear in a new record
- Return type:
tuple
- generate_shingles_matrix(nn_strings, alpha, power, matrix_type, verbose=True, **kw)[source]
Return a weighted sparse matrix of 2-shingles
- Parameters:
alpha (float) – weight of LAST relative to FIRST
power (float) – parameter controlling the impact of name length on cosine distance
matrix_type (str) – description of matrix being built (for logging)
verbose (bool) – True if status messages desired
- Returns:
Weighted sparse 2-shingles matrix
- Return type:
scipy.sparse.csr_matrix
- load_main_index(index_file, **kw)[source]
Load the main index, which is reusable over time as data is added incrementally.
- Parameters:
index_file (str) – path to stored index
- Returns:
nmslib index object
- Return type:
nmslib.FloatIndex
- generate_index(nn_strings, num_workers, M, efC, post, alpha, power, print_progress=True, **kw)[source]
Build an nmslib index based on a list of nn_strings and a set of parameters.
- Parameters:
nn_strings (list) – strings of the form ‘FIRST::LAST’ to shingle and put in matrix (rows)
num_workers (int) – number of threads nmslib should use when parallelizing
M – nmslib parameters
efc – nmslib parameters
post – nmslib parameters
alpha (float) – weight of last-name relative to first-name
power (float) – parameter controlling the impact of name length on cosine distance
print_progress (bool) – controls verbosity of index creation
- Returns:
nmslib index object
- Return type:
nmslib.FloatIndex
- get_indices(params, all_nn_strings, og_blocking_index_file, **kw)[source]
Wrapper function coordinating the creation and/or loading of the nmslib indices.
- Parameters:
params (Parameters object) – contains matching parameter values
all_nn_strings – list of all unique nn_strings in the data (expanded if split_names is True)
og_blocking_index_file – path to a pre-build nmslib index (optional, if doesn’t exist then None)
- Returns:
tuple containing:
main_index (nmslib.FloatIndex): the main nmslib index
main_index_nn_strings (list): nn_strings that are in the main nmslib index
second_index (nmslib.FloatIndex): the secondary nmslib index for querying new nn_strings during incremental runs (often None)
second_index_nn_strings (list): nn_strings that are in the secondary nmslib index (often None)
- Return type:
tuple
- generate_candidate_pairs(nn_strings_to_query, shingles_to_query, nn_string_info, nn_string_expanded_df, main_index, main_index_nn_strings, second_index, second_index_nn_strings, batch_size, **kw)[source]
Wrapper function for querying the nmslib index (or indices) and getting non-matching candidate pairs.
- Parameters:
nn_strings_to_query (list) – nn_strings in new data – those that need near neighbors
shingles_to_query (csr_matrix) – shingles matrix for nn_strings_to_query
nn_string_info (pd.DataFrame) – table with one row per nn_string (or expanded nn_string)
nn_string_expanded_df (pd.DataFrame) – maps a nn_string to a ed_string and absval_string
main_index (nmslib index) – the main nmslib index for querying
main_index_nn_strings (list) – nn_strings in main_index
second_index (nmslib index) – the secondary nmslib index, for some incremental runs
second_index_nn_strings (list) – nn_strings in second_index
batch_size (int) – batch size. Default is 10000 and can be modify in config.yaml file.
- Returns:
candidate-pairs list, before adding in uncovered pairs
blockstring_1
concatenated version of blocking columns for first element in pair (sep by ::)
blockstring_2
concatenated version of blocking columns for second element in pair (sep by ::)
cos_dist
approximate cosine distance between two nn_strings (nmslib)
edit_dist
number of character edits between ed-strings
covered_pair
flag; 1 for pairs that made it through blocking, 0 otherwise; all 1s here
- Return type:
pd.DataFrame
- compute_cosine_sim(blockstrings_in_pairs, pairs_df, shingles_matrix, **kw)[source]
Fast cosine similarity computation using the shingles matrix.
- Parameters:
blockstrings_in_pairs (list) – used to get index of different strings in shingles_matrix
pairs_df (pd.DataFrame) –
blockstrings you want cosine distance between
blockstring_1
blockstring for the first record in the pair
blockstring_2
blockstring for the second record in the pair
covered_pair
flag, 1 if covered 0 otherwise
nn_strings_1
nn_string for the first record in the pair
nn_strings_2
nn_string for the second record in the pair
both_nn_strings
nn_string_1 + ‘ ‘ + nn_string_2
- Returns:
weighted shingles matrix
- Return type:
shingles_matrix (csr_matrix)
- evaluate_blocking(cp_df, tp_df, blocking_scheme, **kw)[source]
The evaluate_blocking function computes the pair completeness metrics to determine how successful blocking was at minimizing comparisons and maximizing true positives (i.e. generating a candidate pair between records that are actually matches).
- Parameters:
cp_df (pd.DataFrame) – candidate pairs df
tp_df (pd.DataFrame) – true pairs df (blockstring_1, blockstring_2)
blocking_scheme (dict) – blocking_scheme (dict): dictionary with info on how to do blocking
- Returns:
portion of candidate-pairs dataframe where covered == 0
- Return type:
pd.DataFrame
- add_uncovered_pairs(candidate_pairs_df, uncovered_pairs_df)[source]
Add the uncovered pairs to the candidate pairs dataframe so that all of the known pairs are in the candidate pairs list.
- Parameters:
candidate_pairs_df (pd.DataFrame) – candidate pairs file produced by blocking
uncovered_pairs_df (pd.DataFrame) – uncovered pairs produced by evaluating blocking
- Returns:
candidate-pairs file
blockstring_1
concatenated version of blocking columns for first element in pair (sep by ::)
blockstring_2
concatenated version of blocking columns for second element in pair (sep by ::)
cos_dist
approximate cosine distance between two nn_strings (nmslib)
edit_dist
number of character edits between ed-strings
covered_pair
flag; 1 for pairs that made it through blocking, 0 otherwise
- Return type:
pd.DataFrame
- apply_blocking_filter(df, thresholds, nn_string_expanded_df, nns_match=False)[source]
Compare similarity of names and DOBs to see if a pair of records are likely to be a match.
- Parameters:
df (pd.DataFrame) –
holds similarity and commonness info about pairs of names
nn_string_1
concatenated version of nn-blocking columns for first element in pair (sep by ::)
nn_string_2
concatenated version of nn-blocking columns for second element in pair (sep by ::)
cos_dist
approximate cosine distance between two nn_strings (nmslib)
commonness_penalty_1
penalty for last-name commonness for first element in pair
commonness_penalty_2
penalty for last-name commonness for second element in pair
thresholds (dict) – information about what blocking distances are allowed
nn_string_expanded_df (pd.DataFrame) – maps a nn_string to a ed_string and absval_string
nns_match (bool) – True if this function is called by get_exact_match_candidate_pairs
- Returns:
chunk of the candidate-pairs list
blockstring_1
concatenated version of blocking columns for first element in pair (sep by ::)
blockstring_2
concatenated version of blocking columns for second element in pair (sep by ::)
cos_dist
approximate cosine distance between two nn_strings (nmslib)
edit_dist
number of character edits between ed-strings
covered_pair
flag; 1 for pairs that made it through blocking, 0 otherwise; all 1s here
- Return type:
pd.DataFrame
- disallow_switched_pairs(df, incremental, nn_strings_to_query)[source]
Look through the columns nn_string_1 and nn_string_2 and keep only rows where nn_string1 <= nn_string2 to prevent duplicates in the end (i.e. ABBY->ZABBY & ZABBY->ABBY; only one is needed). Special case for incremental runs.
- Parameters:
df (pd.DataFrame) – holds similarity and commonness info about pairs of names
incremental (bool) – Ture if current run incremental
nn_strings_to_query (list) – nn_strings that are in “to query” list
- Returns:
same as input df, but no AB/BA duplicates
- Return type:
pd.DataFrame
- get_actual_candidates(near_neighbors_df, nn_string_expanded_df, nn_strings_to_query, thresholds, incremental, output=None)[source]
Actually determines whether two names become candidates; this function is launched by generate_candidate_pairs() and run on individual worker threads to speed up processing.
- Parameters:
near_neighbors_df (pd.DataFrame) –
holds similarity and commonness info about pairs of names
nn_string_ix
a string with nn_string_ix = i is the string located at nn_strings_queried_this_batch[i]
nn_string_1
concatenated version of nn-blocking columns for first element in pair (sep by ::)
nn_string_2
concatenated version of nn-blocking columns for second element in pair (sep by ::)
cos_dist
approximate cosine distance between two nn_strings (nmslib)
commonness_penalty_1
penalty for last-name commonness for first element in pair
commonness_penalty_2
penalty for last-name commonness for second element in pair
nn_string_expanded_df (pandas dataframe) – table at nn_string/ed_string/absval_string level (expanded if split_name is True)
nn_strings_to_query (list) – nn_strings in the “to query” list (needed for incremental check)
thresholds (dict) – information about what blocking distances are allowed
incremental (bool) – True if current run incremental
output – None if the output should be returned, rather than written
- Returns:
chunk of the candidate-pairs list
blockstring_1
concatenated version of blocking columns for first element in pair (sep by ::)
blockstring_2
concatenated version of blocking columns for second element in pair (sep by ::)
cos_dist
approximate cosine distance between two nn_strings (nmslib)
edit_dist
number of character edits between ed-strings
covered_pair
flag; 1 for pairs that made it through blocking, 0 otherwise; all 1s here
- Return type:
pd.DataFrame
- get_near_neighbors_df(near_neighbors_list, nn_string_info, nn_strings_this_index, nn_strings_queried_this_batch)[source]
For a small batch of names (nn_strings_queried_this_batch), format a dataframe that enumerates every pair of (name in this batch, a near neighbor), along with information about similarity and commonness.
- Parameters:
near_neighbors_list (list) – list of (list of k IDs, list of k distances) tuples, of length batch_size
nn_string_info (pd.DataFrame) – table mapping nn_string to commonness_penalty
nn_strings_this_index (list) – nn_strings in the current index
nn_strings_queried_this_batch (list) – nn_strings in the current query batch (length batch_size), whose neighbors are stored in near_neighbors_list
- Returns:
holds similarity and commonness info about pairs of names
nn_string_ix
a string with nn_string_ix = i is the string located at nn_strings_queried_this_batch[i]
nn_string_1
concatenated version of nn-blocking columns for first element in pair (sep by ::)
nn_string_2
concatenated version of nn-blocking columns for second element in pair (sep by ::)
cos_dist
approximate cosine distance between two nn_strings (nmslib)
commonness_penalty_1
penalty for last-name commonness for first element in pair
commonness_penalty_2
penalty for last-name commonness for second element in pair
- Return type:
pd.DataFrame
- get_exact_match_candidate_pairs(nn_string_info_multi, nn_string_expanded_df, blocking_thresholds)[source]
All nn_strings that appear more than once need to have a corresponding nn_string, nn_string candidate pair – we can skip the “approximation” easily for this type of candidate pair.
- Parameters:
nn_string_info_multi (pd.DataFrame) – nn_string_info, subset to nn_strings with n_new > 0 & n_total > 1
nn_string_expanded_df (pd.DataFrame) – table at nn_string/ed_string/absval_string level (expanded if split_name is True)
blocking_thresholds (dict) – dictionary with thresholds for blocking, e.g. high and low bar
- Returns:
portion of the candidate pairs list (where nn_string_1 == nn_string_2)
nn_string
concatenated version of nn-blocking columns (sep by ::)
commonness_penalty
float indicating how common the last name is
n_new
number of times this nn_string appears in a “new” record
n_existing
number of times this nn_string appears in an “existing” record
n_total
number of times this nn_string appears in any record
- Return type:
pd.DataFrame
- namematch.block.get_blocking_columns(blocking_scheme)[source]
Get the list of blocking variables for each type of blocking:
- Parameters:
blocking_scheme (dict) – dictionary with info on how to do blocking
- Returns:
the variable names needed for each type of blocking
- Return type:
list of string list
- namematch.block.read_an(an_file, nn_cols, ed_col, absval_col)[source]
Read in relevant columns for blocking from the all-names file.
- Parameters:
an_file (str) – path to the all-names file
nn_cols (list of strings) – variables for near neighbor blocking
ed_col (list of strings) – variables for edit-distance blocking
absval_col (list of strings) – variables for absolute-value blocking
- Returns:
all-names dataframe, relevant columns only (where drop_from_nm == 0)
record_id
unique record identifier
blockstring
concatenation of all the blocking variables (sep by ::)
file_type
either “new” or “existing”
drop_from_nm
flag, 1 if met any “to drop” criteria 0 otherwise
<nn-blocking column(s)>
variables for near-neighbor blocking
<ed-blocking column>
variable for edit-distance blocking
<av-blocking column>
(optional) variable for abs-value blocking
nn_string
concatenated version of nm-blocking columns (sep by ::)
ed_string
copy of ed-blocking column
absval_string
copy of ed-blocking column
- Return type:
pd.DataFrame
- namematch.block.get_nn_string_counts(an)[source]
Count number of records per nn_strings (per file_type).
- Parameters:
an (pd.DataFrame) –
all-names table, relevant columns only (where drop_from_nm == 0)
record_id
unique record identifier
blockstring
concatenation of all the blocking variables (sep by ::)
file_type
either “new” or “existing”
drop_from_nm
flag, 1 if met any “to drop” criteria 0 otherwise
<nn-blocking column(s)>
variables for near-neighbor blocking
<ed-blocking column>
variable for edit-distance blocking
<av-blocking column>
(optional) variable for abs-value blocking
nn_string
concatenated version of nm-blocking columns (sep by ::)
ed_string
copy of ed-blocking column
absval_string
copy of ed-blocking column
- Returns:
two keys (new and existing), mapping to a dictionary of nn_strings to n_records
- Return type:
dict
- namematch.block.get_common_name_penalties(clean_last_names, max_penalty, num_threshold_bins=1000)[source]
Create a dictionary mapping each last name to a “commonness penalty.” Two SMITHs are less likely to be the same person than two HANDAs, since SMITH is such a common name. This function quantifies this penalty for use in later blocking calculations. A more common name recieves a higher number, topping out at max_penalty.
- Parameters:
clean_last_names (pd.Series) – clean (un-split) last name column (one row per record)
max_penalty (float) – the maximum penalty (for the most common names)
num_threshold_bins (int) – number of different categories of commonnness to create
- Returns:
dictionary mapping name (str) to penalty (float)
- Return type:
dict
- namematch.block.get_all_shingles()[source]
Get all valid 2-shingles.
- Returns:
valid 2-shingles
- Return type:
list
- namematch.block.prep_index()[source]
Initialize index data structure, which will store similarity information about the names, and load processed shingles into it.
- Returns:
nmslib index object (pre time-consuming build call)
- Return type:
nnmslib.FloatIndex
- namematch.block.get_second_index_nn_strings(all_nn_strings, main_nn_strings)[source]
Get nn_strings that haven’t already been stored in the main index.
- Parameters:
all_nn_strings (list) – list of all nn_strings in the data (expanded if split_names is True)
main_nn_strings (list) – list of nn_strings already in the main index
- Returns:
the nn_strings that are not in main_nn_strings
- Return type:
list
- namematch.block.save_main_index(main_index, main_index_nn_strings, main_index_file)[source]
Save the main nmslib index and pickle dump the associated nn_strings list.
- Parameters:
main_index (nmslib.FloatIndex) – the main, built nmslib index
main_index_nn_strings (list) – list of nn_strings in the main index
main_index_file (str) – path to store the main nmslib index
- namematch.block.load_main_index_nn_strings(og_blocking_index_file)[source]
Load the nn_strings that are in an existing nmslbi index file.
- Parameters:
og_blocking_index_file (str) – path to original blocking index
- Returns:
loaded list of nn_strings in an existing nmslib index
- Return type:
list
- namematch.block.write_some_cps(cand_pairs, candidate_pairs_file)[source]
Write out a portion of the candidate-pairs to parquet.
- Parameters:
cand_pairs (pd.DataFrame) – chunk of the candidate-pairs file
candidate_pairs_file (str) – path to the candidate-pairs file
- namematch.block.generate_true_pairs(must_links_df)[source]
Reduce the must-link records pairs must-link blockstring pairs.
- Parameters:
must_links_df (pd.DataFrame) –
list of must-link record pairs
record_id_1
unique identifier for the first record in the pair
record_id_2
unique identifier for the second record in the pair
blockstring_1
blockstring for the first record in the pair
blockstring_2
blockstring for the second record in the pair
drop_from_nm_1
flag, 1 if the first record in the pair was not eligible for matching
drop_from_nm_2
flag, 1 if the second record in the pair was not eligible for matching
existing
flag, 1 if the pair is must-link because of ExistingID
- Returns:
list of must-link blockstring pairs (where both record have drop_from_nm == 0)
blockstring_1
blockstring for the first record in the pair
blockstring_2
blockstring for the second record in the pair
- Return type:
pd.DataFrame
namematch.generate_data_rows
- class namematch.generate_data_rows.GenerateDataRows(params, schema, output_dir, all_names_file, candidate_pairs_file, *args, **kwargs)[source]
Bases:
NamematchBase
- property output_files
- main(**kw)[source]
Take candidate pairs and merge on the all-names records (twice) to get a dataset at the record pair level. Compute distance metrics between the records in the pair – these are the features for modeling.
- Parameters:
params (Parameters object) – contains parameter values
schema (Schema object) – contains match schema info (files to match, variables to use, etc.)
all_names_file (str) – path to the all-names file
candidate_pairs_file (str) – path to the candidate-pairs file
output_dir (str) – path to the data-rows dir
- generate_name_probabilities_object(an, fn_col=None, ln_col=None, **kw)[source]
The generate_name_probabilites function uses a list of names (from all_names file) and creates an object containing queryable probability information for each name.
- Parameters:
an (pd.DataFrame) – all-names, just the name columns
fn_col (str) – name of first name column
ln_col (str) – name of last name column
- Returns:
name probability object
- generate_actual_data_rows(params, schema, sbs_df, np_object, first_iter)[source]
Create modeling dataframe by comparing each variable (via numerous distance metrics).
- Parameters:
params (Parameters object) – contains matching parameters
schema (Schema object) – contains matching schema (data files and variables)
sbs_df (pd.DatFrame) –
side-by-side table (record pair level, with info from both an records)
record_id (_1, _2)
unique record identifier
blockstring (_1, _2)
concatenated version of blocking columns (sep by ::)
file_type (_1, _2)
either “new” or “existing”
candidate_pair_ix
index from candidate-pairs list
covered_pair
flag, 1 if blockstring pair passed blocking 0 otherwise
<fields for matching> (_1, _2)
both for the matching model and for constraint checking
np_object (nm_prob.NameProbability object) – contains information about name probabilities
- Returns:
chunk of the data-rows file
dr_id
unique record pair identifier (record_id_1__record_id_2)
record_id (_1, _2)
unique record identifiers
<distance metrics>
how similar are the different matching fields between recrods
label
”1” if the records refer to the same person, “0” if not, “” otherwise
- Return type:
pd.DataFrame
- generate_data_row_files(params, schema, an, cp_df, name_probs, start_ix_worker, end_ix_worker, dr_file, **kw)[source]
The get_data_row_files function is run in parallel to generate the data needed for the random forest; it performs the merge between candidate pairs and all-names and calls the function that calculates distance metrics.
- Parameters:
params (Parameters object) – contains matching parameters
schema (Schema object) – contains matching schema (data files and variables)
an (pd.DatFrame) –
all-names table (one row per input record)
record_id
unique record identifier
file_type
either “new” or “existing”
<fields for matching>
both for the matching model and for constraint checking
<raw name fields>
pre-cleaning version of first and last name
blockstring
concatenated version of blocking columns (sep by ::)
drop_from_nm
flag, 1 if met any “to drop” criteria 0 otherwise
cp_df (pd.DataFrame) –
candidate-pairs list
blockstring_1
concatenated version of blocking columns for first element in pair (sep by ::)
blockstring_2
concatenated version of blocking columns for second element in pair (sep by ::)
covered_pair
flag; 1 for pairs that made it through blocking, 0 otherwise; all 1s here
name_probs (nm_prob.NameProbability object) – contains information about name probabilities
start_ix_worker (int) – starting index of the candidate-pairs chunk to read in this thread
end_ix_worker (int) – end index of the candidate-pairs chunk to read in this thread
dr_file (str) – path to data-rows file to write (one for each worker thread)
namematch.fit_model
- class namematch.fit_model.FitModel(params, all_names_file, data_rows_dir, model_info_file, output_dir, trained_model_info_file='None', selection_model_name='basic_selection_model.pkl', match_model_name='basic_match_model.pkl', flipped0_file=None, *args, **kwargs)[source]
Bases:
NamematchBase
- Parameters:
params (Parameters object) – contains parameter values
all_names_file (str) – path to the all-names file
data_rows_dir (str) – path to the data-rows dir
model_info_file (str) – path to the model info yaml file
output_dir (str) – path to the model dir
traiend_model_info_file (str) – path to the model info yaml file of a previously trained model
selection_model_name (str) – selection model name
match_model_name (str) – match model name
flipped0_file (str) – flipped0 file path
- property output_files
- property dr_file_list
- main(**kw)[source]
Train and evaluate random foreset model(s). Depending on the settings, this might involved training and evaluating multiple types of models (e.g. selection and match models) and/or models for different data-row types (e.g. basic and no-dob).
- fit_model(df, vars_to_exclude, outcome, weights=None, n_jobs=1, **kw)[source]
Fit random forest model.
- Parameters:
df (pd.DataFrame) – data rows, subset to training rows
vars_to_exclude (list) – variables to disallow from the model
outcome (string) – name of the column that we’re predicting
weights (list) – sample weights to use for training (can be None)
n_jobs (int) – number of jobs to run in parallel
- Returns:
tuple containing:
mod (sklearn.ensemble.RandomForestClassifier): trained sklearn random forest model object
feature_info (pd.DataFrame): feature_importance
- Return type:
tuple
- fit_models(train_df, model_type, model_info)[source]
Fit random forest model.
- Parameters:
train_df (pd.DataFrame) – data rows, subset to training rows
model_type (string) – either “selection” or “match”
model_info (dict) – dict with information about how to fit the model
- Returns:
maps model name (e.g. basic or no_dob) to a trained model object
- Return type:
dict
- evaluate_models(phats_df, outcome, model_type, weight_using_selection_model=False, default_threshold=0.5, missingness_model_threshold_boost=0.2, optimize_threshold=False, fscore_beta=1.0, **kw)[source]
- get_train_eval_data(an_train_eligible_dict, model_info, params, model_type, any_train=True)[source]
Load data-rows, filter to rows that are eligible for training a givem model type, and then split the data into a training set and a labeled evaulation set.
- Parameters:
an_train_eligible_dict (dict) – maps record_id to flag indicating record’s all-names based training eligibility
model_info (dict) – dict with information about how to fit the model
params (Parameters object) – contains parameter values
model_type (str) – either “selection” or “match”
any_train (bool) – True if you want training data (e.g. not a pre-trained model), False otherwise
- Returns:
data rows, filtered to training data (excluding labeled eval data) pd.DataFrame: data rows, filtered to labeled eval data float: share of data rows that are labeled
- Return type:
pd.DataFrame
- find_valid_training_records(an, an_match_criteria)[source]
Identify records that meet the all-names criteria for training data.
- Parameters:
an (pd.DataFrame) –
all-names table (one row per input record)
record_id
unique record identifier
file_type
either “new” or “existing”
<fields for matching>
both for the matching model and for constraint checking
<raw name fields>
pre-cleaning version of first and last name
blockstring
concatenated version of blocking columns (sep by ::)
drop_from_nm
flag, 1 if met any “to drop” criteria 0 otherwise
an_match_criteria (dict) – keys are all-names columns, mapped to acceptable values
- Returns:
flag, 1 if the record is eligible for training set 0 otherwise
- Return type:
pd.Series
- namematch.fit_model.get_feature_info(pipeline, raw_num_cols, raw_cat_cols)[source]
Extract the feature importance information from a sklearn model pipeline.
- Parameters:
pipeline (skleran fitted pipeline) – trained model
raw_num_cols (list) – numeric columns that went into the model (before pipeline processing)
raw_cat_cols (list) – categorical columns that went into the model (before pipeline processing)
- Returns:
feature importance information
feature
name of the feautre
importance
relative importance of this feature to the model
- Return type:
pd.DataFrame
- namematch.fit_model.save_models(selection_models, match_models, model_info)[source]
Save the models to file.
- Parameters:
selection_models (dict) – maps model name (e.g. basic or no-dob) to a fit match model object
match_models (dict) – maps model name (e.g. basic or no-dob) to a fit selection model object
model_info (dict) – dict with information about how to fit the model
- namematch.fit_model.define_necessary_models(dr_file_list, output_dir, missing_mod_field=None, selection_model_name='basic_selection_model.pkl', match_model_name='basic_match_model.pkl')[source]
Determine the different models needed (using a sample) and define the characteristics of data that determine which model should handle it.
- NOTE: Right now, there is an assumption that the training universe
is the same between all models (i.e. basic and missingness)
- Parameters:
dr_file_list (list) – list of paths to all data row files
output_dir (str) – model output folder path
missing_mod_field (str or None) – field that could trigger need for separate model
- Returns:
mapping the name of a model (str) to a dict of the following information:
selection_model_name (str)
match_model_name (str)
type (str): one of “default” or “missingness”
actual_phat_universe (dict): maps a variable name to a value(?)
vars_to_exclude (str list)
match_thresh (float): threshold for match/nonmatch
- Return type:
dict
- namematch.fit_model.load_and_save_trained_model(trained_model_info_file, output_file)[source]
Load a set of pre-trained models and copy them to the current run’s output directory. Typically only used in incremental runs.
- Parameters:
trained_model_info_file (str) – path to a model yaml file, which has path/threshold/universe info
output_file (str) – path to output the current run’s model yaml file (for copying)
- Returns:
maps model name (e.g. basic or no_dob) to a trained model object
- Return type:
dict
- namematch.fit_model.get_match_train_eligible_flag(df, dr_train_eligible_conditions_dict, an_train_eligible_dict)[source]
Determine if a data-row is eligible for training (for match models), according to both all-names eligibility criteria and data-row eligibility criteria.
- Parameters:
df (pd.DataFrame) – portion of data-rows file, limited to labeled rows
dr_train_eligible_conditions_dict (dict) – contains data-row training eligibility criteria
an_train_eligible_dict (dict) – maps record_id to flag indicating record’s all-names based training eligibility
- Returns:
flag, 1 if data-row is training eligible (for match models)
- Return type:
pd.Series
- namematch.fit_model.add_threshold_dict(model_info, thresholds_dict)[source]
Add threshold information to the model_info dict, once it’s been determined.
- Parameters:
model_info (dict) – dict with information about how to fit the model
thresholds_dict (dict) – keys are model name (e.g. basic, no-dob), values are optimized thresholds
- Returns:
model dict, now with threshold info
- Return type:
dict
- namematch.fit_model.get_flipped0_potential_edges(phats_df, model_info, allow_clusters_w_multiple_unique_ids)[source]
If allowed, identify the set of labeled 0s with high phats so they can be treated as matches downstream.
- Parameters:
phats_df (pd.DataFrame) –
phat info for record pairs
record_id (_1, _2)
unique record identifiers
model_to_use
based on pair characteristics, which model to use (e.g. basic or no-dob)
covered_pair
did the pair make it through blocking
match_train_eligible
is the pair eligible for training (for match model)
exactmatch
is the pair an exact match on name/dob
label
whether the pair is a match or not
<phat_col>
predicted probability of match
model_info (dict) – dict with information about how to fit the model
allow_clusters_w_multiple_unique_ids (bool) – param controlling if 0s can be flipped to 1
- Returns:
same as phats_df, just
- Return type:
pd.DataFrame
namematch.predict
- class namematch.predict.Predict(params, data_rows_dir, model_info_file, output_dir, *args, **kwargs)[source]
Bases:
NamematchBase
- Parameters:
params (Parameters object) – contains parameter values
model_info_file (str) – path to the data-rows dir
data_rows_dir (str) – path to the model info yaml file for a trained model
output_dir (str) – path to the potential-links dir
- property output_files
- property dr_file_list
- main(**kw)[source]
Read in data-rows and predict (in parallel) for each unlabeled pair. Output the pairs above the threshold.
- get_potential_edges(dr_file, match_models, model_info, output_dir, params, **kw)[source]
Read in data rows in chunks and predict as needed. Write (append) the edges above the threshold to the appropriate file.
- Parameters:
dr_file (string) – path to data file to predict for
match_models (dict) – maps model name (e.g. basic or no-dob) to a fit match model object
model_info (dict) – contains information about threshold
output_dir (str) – directory to place potential links
params (Parameters obj) – contains parameter values (i.e. use_uncovered_phats)
- get_potential_edges_in_parallel(match_models, model_info, output_dir, params)[source]
Dispatch the worker threads that will predict for unlabeled pairs in paralle.
- Parameters:
match_models (dict) – maps model name (e.g. basic or no-dob) to a fit match model object
model_info (dict) – dict with information about how to fit the model
output_dir –
params (Parameters object) – contains parameter values
- classmethod predict(models, df, model_type, oob=False, all_cols=False, all_models=True, prob_match_train=None)[source]
Use the trainined models to predict for pairs of records.
- Parameters:
models (dict) – maps model name (e.g. basic or no-dob) to a fit match model object
df (pd.DataFrame) – portion of the data-rows table, with a “model_to_use” column appended
model_type (str) – model type (e.g. selection or match)
oob (bool) – if True, use the out-of-bag predictions
all_cols (bool) – if True, keep all columns in the output df; not just the relevant ones
all_models (bool) – if True, predict for each row using all models, not just the “model to use”
prob_match_train (float) – share of data-rows that are labeled
namematch.cluster
- class namematch.cluster.Constraints[source]
Bases:
object
- property get_columns_used
- property is_valid_link
- property is_valid_cluster
- property apply_link_priority
- class namematch.cluster.Cluster(params, schema, must_links_file='must_links.parquet', potential_edges_dir='potential_links', flipped0_edges_file='flipped0_potential_links.csv', all_names_file='all_names.parquet', cluster_assignments='cluster_assignments.pkl', edges_to_cluster='edges_to_cluster.parquet', constraints: str | Constraints | None = None, *args, **kwargs)[source]
Bases:
NamematchBase
- Parameters:
params (Parameters object) – contains parameter values
schema (Schema object) – contains match schema info (files to match, variables to use, etc.)
constraints (str or Constrants object) – either a path to python script defining constraint functions or a Constraints object
must_links_file (str) – path to the must-links file
potential_edges_dir (str) – path to the potential-links dir in the output/details folder
flipped0_edges_file (str) – path to the flipped-links file
all_names_file (str) – path to the all-names file
cluster_assignments (str) – path to the cluster-assignments file
- property output_files
- main(**kw)[source]
Read the record pairs with high probability of matching and connect them in a way that doesn’t violate any logic constraints to form clusters.
- auto_is_valid_edge(edges_df, uid_cols, allow_clusters_w_multiple_unique_ids, leven_thresh, eid_col=None)[source]
Check if two records would violate a unique id or existing id constraint.
- Parameters:
edges_df (pd.DataFrame) –
potential edges information
record_id_1
unique record identifier (for first in pair)
record_id_2
unique record identifier (for second in pair)
phat
predicted probability of a record pair being a match
original_order
original ordering 1-N (useful so gt is always on top of phat=1 cases)
uid_cols (list) – all-names column(s) with compare_type UniqueID
allow_clusters_w_multiple_unique_ids (bool) – True if a cluster can have multiple uid values
leven_thresh (int) – n character edits to allow between uids before they’re considered different
eid_col (str) – all-names column with compare_type ExistingID (None for non-incremental runs)
- Returns:
potential edges information, but limited to rows that pass the automated validity check
- Return type:
valid_edges_df
- auto_is_valid_cluster(cluster, uid_cols, allow_clusters_w_multiple_unique_ids, leven_thresh, eid_col=None)[source]
Check if a proposed cluster would violate a unique id or existing id constraint.
- Parameters:
cluster (pd.DataFrame) – all-names file (relevant columns only) records for the proposed cluster
uid_cols (list) – all-names column(s) with compare_type UniqueID
allow_clusters_w_multiple_unique_ids (bool) – True if a cluster can have multiple uid values
leven_thresh (int) – n character edits to allow between uids before they’re considered different
eid_col (str) – all-names column with compare_type ExistingID (None for non-incremental runs)
- Returns:
False if an automated constraint is violated
- Return type:
bool
- get_initial_clusters(must_links_df, an_df, eid_col, **kw)[source]
Use must links (ground truth and/or a previous run) to create the starting clusters.
- Parameters:
must_links_df (pd.DataFrame) –
record pairs that must be linked together no matter what
record_id_1
unique identifier for the first record in the pair
record_id_2
unique identifier for the second record in the pair
blockstring_1
blockstring for the first record in the pair
blockstring_2
blockstring for the second record in the pair
drop_from_nm_1
flag, 1 if the first record in the pair was not eligible for matching
drop_from_nm_2
flag, 1 if the second record in the pair was not eligible for matching
existing
flag, 1 if the pair is must-link because of ExistingID
an_df (pd.DataFrame) –
all-names file, with only the columns relevant for clustering
record_id
unique record identifier
<uid column(s)>
columns with compare_type UniqueID
<eid column(s)>
columns with compare_type ExistingID
<user-constraint column(s)>
(optional) columns mentioned in get_columns_used()
eid_col (str) – all-names column with compare_type ExistingID, or None
- Returns:
clusters maps a cluster id to a list of record ids dict: cluster_assignments maps a record_id to a cluster_id set: cluster ids that are already in use (only for incremental)
- Return type:
dict
- get_potential_edges(potential_edges_files, flipped0_edges_file, gt_1s_df, cluster_logic, cluster_info, uid_cols, eid_col, **kw)[source]
Use all predictions file to make a list of edges that the constrained clustering algorithm should try to add.
- Parameters:
potential_edges_files (list) – paths to the potential links files
flipped0_edges_file (str) – path to the flipped0-links file
gt_1s_df (pd.DataFrame) – known y=1s; will be matched, pending the edge/cluster validity
cluster_logic (module) – user-defined constraint functions
cluster_info (pd.DataFrame) – all-names file, with only the columns relevant for clustering
uid_cols (list) – all-name columns with compare_type UniqueID
eid_col (str) – all-name column with compare_type ExistingID
- load_cluster_info(all_names_file, uid_cols, eid_col, cluster_logic, **kw)[source]
Read in the all_names information needed for cluster constraint checking. Columns defined in the config as compare type UniqueID or ExistingID will automatically be loaded (as strings, with missing values represented as NA). Other columns you wish to be loaded should be defined in the user-defined get_columns_used() function.
- Parameters:
all_names_file (str) – path to the all-names file
uid_cols (list) – all-name columns with compare_type UniqueID
eid_col (str) – all-name column with compare_type ExistingID
cluster_logic (module) – user-defined constraint functions
- Returns:
all-names file, with only the columns relevant for clustering
record_id
unique record identifier
<uid column(s)>
columns with compare_type UniqueID
<eid column(s)>
columns with compare_type ExistingID
<user-constraint column(s)>
(optional) columns mentioned in get_columns_used()
- Return type:
pd.DataFrame
- cluster_potential_edges(clusters, cluster_assignments, original_cluster_ids, cluster_info, cluster_logic, uid_cols, eid_col, **kw)[source]
For clusters by add potential edges to the cluster graph in order of importance, skipping those that cause violations.
- Parameters:
clusters (dict) – maps a cluster id to a list of record ids – post initialization
cluster_assignments (dict) – maps a record_id to a cluster_id – post initialization
original_cluster_ids (set) – set: cluster ids that are already in use (only for incremental)
cluster_info (pd.DataFrame) –
all-names file, with only the columns relevant for clustering
record_id
unique record identifier
<uid column(s)>
columns with compare_type UniqueID
<eid column(s)>
columns with compare_type ExistingID
<user-constraint column(s)>
(optional) columns mentioned in get_columns_used()
potential_edges (deque) – each element is a dict version of a potential edge’s record
cluster_logic (module) – user-defined constraint functions
uid_cols (list) – all-name columns with compare_type UniqueID
eid_col (str) – all-name column with compare_type ExistingID
- Returns:
maps record_id to cluster_id
- Return type:
dict
namematch.generate_output
- class namematch.generate_output.GenerateOutput(params, schema, all_names_file, cluster_assignments_file, an_output_file, output_dir, output_file_uuid=None, *args, **kwargs)[source]
Bases:
NamematchBase
- Parameters:
params (Parameters object) – contains parameter values
schema (Schema object) – contains match schema info (files to match, variables to use, etc.)
all_names_file (str) – path to the all-names file
cluster_assignments_file (str) – path to the cluster-assignments file
an_output_file (str) – path to the all-names-with-clusterid file
output_dir (str) – path to final output directory
- property output_files
- main(**kw)[source]
Read in the cluster assignments dictionary and use it to create all-names-with-cluster-id and the “with-cluster-id” versions of input dataset.
- create_allnames_clusterid_file(all_names_file, cluster_assignments, cleaned_col_names, **kw)[source]
Create all-names-with-clusterid dataframe.
- Parameters:
all_names_file (str) – path to the all-names file
cluster_assignments (dict) – maps record_id to cluster_id
cleaned_col_names (list) – all-name columns used in cosine blocking
- Returns:
all-names-with-cluster-id
record_id
unique record identifier
file_type
either “new” or “existing”
<fields for matching>
both for the matching model and for constraint checking
blockstring
concatenated version of blocking columns (sep by ::)
drop_from_nm
flag, 1 if met any “to drop” criteria 0 otherwise
cluster_id
unique person identifier, no missing values
- Return type:
pd.DataFrame
- output_clusterid_files(data_files, cluster_assignments, output_dir, output_file_uuid=None, **kw)[source]
For each input file, construct a matching output file that has the cluster_id column, and write it.
- Parameters:
data_files (list of DataFile objects) – contains info about each input file
cluster_assignments (dict) – maps record_id to cluster_id
output_dir (str) – the path that was supplied when the name match object was created
namematch.utils.utils
- namematch.utils.utils.setup_logging(log_params, log_filepath, output_temp_dir, filter_stats=False, logging_level='INFO')[source]
Setup logging configuration.
- Parameters:
log_params (dict) – contains info for logging setup
log_filepath (str) – path to store logs
- namematch.utils.utils.log_stat(human_desc, yaml_desc, value)[source]
Log a statistic in the log and in the stats yaml.
- Parameters:
human_desc (str) – human readable description of the stat (could be a phrase)
yaml_desc (str) – concise yaml-key compatible description of the stat
value (float or str) – value of the stat
- namematch.utils.utils.log_runtime_and_memory(method)[source]
Decorator that logs time to execute functions and records max memory usage in GB.
- Parameters:
method (function) – function to measure/log runtime and memory usage
- Returns:
value returned by the function being decorated
- namematch.utils.utils.load_yaml(yaml_file)[source]
Load a yaml file into a dictionary.
- Parameters:
yaml_file (str) – path to yaml file
- Returns:
dictionary version of input yaml file
- Return type:
dict
- namematch.utils.utils.dump_yaml(dict_to_write, yaml_file)[source]
Write a dictionary into a yaml file.
- Parameters:
dict_to_write (dict) – dict to write to yaml
yaml_file (str) – path to output yaml file
- namematch.utils.utils.to_dict(obj)[source]
Convert an object (i.e. instance of a user-defined class) into a dictionary to make writing easier.
- Parameters:
obj (object) – class instance to convert to dict
- namematch.utils.utils.clean_nn_string(n)[source]
Removes JR, SR, II, extra spaces, etc. from nn strings. The original string in the dataframe keeps punctuation and suffixes.
- Parameters:
n (str) – raw name value
- Returns:
clean version of the input name
- Return type:
str
- namematch.utils.utils.build_blockstring(df, blocking_scheme, incl_ed_string=True)[source]
Create blockstrings (values for blocking separated by ::, such as JOHN::SMITH::1993-07-23) from all-names data.
- Parameters:
df (pd.DataFrame) –
all-names table
record_id
unique record identifier
file_type
either “new” or “existing”
<fields for matching>
both for the matching model and for constraint checking
drop_from_nm
flag, 1 if met any “to drop” criteria 0 otherwise
blocking_scheme (dict) – contains info about fields to block on
incl_ed_string (bool) – True if the blockstring should end with the edit-distance string (e.g. dob)
- Returns:
blockstrings
- Return type:
pd.Series
- namematch.utils.utils.get_nn_string_from_blockstring(blockstring)[source]
Parse out the near-neighbor string (e.g. first-name and last-name) from a blockstring.
- Parameters:
blockstring (str) – string with info for blocking (e.g. JOHN::SMITH::1993-07-23)
- Returns:
near-neighbor string (e.g. JOHN::SMITH)
- Return type:
str
- namematch.utils.utils.get_ed_string_from_blockstring(blockstring)[source]
Parse out the edit-distance string (e.g. dob) from a blockstring.
- Parameters:
blockstring (str) – string with info for blocking (e.g. JOHN::SMITH::1993-07-23)
- Returns:
edit-distance string (e.g. 1993-07-23)
- Return type:
str
- namematch.utils.utils.get_endpoints(n, num_chunks)[source]
Divide a number into some number of chunks/intervals.
- Parameters:
n (int) – number to divide into chunks/intervals
num_chunks (int) – number of chunks/intervals to create
- Returns:
list of start and end points to cover entire range
- Return type:
list of int tuples
- namematch.utils.utils.load_sample(csv_path, pct, cols=None)[source]
Load a random sample of a csv into pandas.
- Parameters:
csv_path (str) – path to csv file
pct (float) – what percent of the file to randomly read
cols (list) – columns to load
- Returns:
random subset of the input csv
- Return type:
pd.DataFrame
- namematch.utils.utils.load_csv_list(df_file_list, cols=None, conditions_dict={}, sample=1)[source]
Read a list of .csv files into a single pd.DataFrame.
- Parameters:
df_file_list (list of str) – list of .csv files to read
cols (list) – columns to keep in the dataframe
conditions_dict (dict) – conditions for row filtering
sample (float) – share of rows to randomly sample from the final dataframe
- Returns:
filtered sampled dataframe read in from the .csv files
- Return type:
pd.DataFrame
- namematch.utils.utils.load_parquet(df_file, cols=None, conditions_dict={})[source]
Read a .parquet file into a pd.DataFrame.
- Parameters:
df_file (str) – .parquet file to read
cols (list) – columns to keep in the dataframe
conditions_dict (dict) – conditions for row filtering
- Returns:
filtered dataframe read in from the .parquet file
- Return type:
pd.DataFrame
- namematch.utils.utils.load_parquet_list(df_file_list, cols=None, conditions_dict={}, sample=1)[source]
Read a list of .parquet files into a single pd.DataFrame.
- Parameters:
df_file_list (list of str) – list of .parquet files to read
cols (list) – columns to keep in the dataframe
conditions_dict (dict) – conditions for row filtering
sample (float) – share of rows to randomly sample from the final dataframe
- Returns:
filtered sampled dataframe read in from the .parquet files
- Return type:
pd.DataFrame
- namematch.utils.utils.determine_model_to_use(dr_df, model_info, verbose=False)[source]
Assign a model to each data row based on which fields are available.
- Parameters:
dr_df (pd.DataFrame) –
data rows
record_id_1
unique identifier for the first record in the pair
record_id_2
unique identifier for the second record in the pair
<distance metric fields>
distance metrics between the two records’ matching fields
label
flag, “1” if the records are a match, “0” if not, “” if unknown
model_info (dict) – information about models and their universes
verbose (bool) – flag controlling logging statement (set according to which function calls this one)
- Returns:
string indicating which model to use for a given record pair
- Return type:
pd.Series
- namematch.utils.utils.load_models(model_info_file, selection=False)[source]
Load pre-trained models (selection and match, as available)
- Parameters:
model_info_file (str) – path to original model config
selection (bool) – if True, try to load a corresponding selection model
- Returns:
maps model name (e.g. basic or no-dob) to a fit model object dict: dict with information about how to fit the model
- Return type:
dict
namematch.generate_report
- class namematch.generate_report.IgnoreBlackWarning(name='')[source]
Bases:
Filter
Initialize a filter.
Initialize with the name of the logger which, together with its children, will have its events allowed through the filter. If no name is specified, allow every event.
- class namematch.generate_report.GenerateReport(params, schema, report_file, *args, **kwargs)[source]
Bases:
NamematchBase
params (Parameters object): contains parameter values schema (Schema object): contains match schema info (files to match, variables to use, etc.) report_file (str): full path of the report html file
- property output_files