disamby package¶
Submodules¶
disamby.cli module¶
Console script for disamby.
With the cli it is possible to carry out the disambiguation on the command line. The script returns a json file with a name taken from the first field as the key and a list of indices belonging to the same disambiguated cluster.
{'International Business Machines Inc': [1, 2, 5, 34, 90],
'Samsung': [4, 123],
'...': [...],
...}
To carry out the disambiguation you will need a csv file with proper headings and optionally an index column. If not index column is specified then the position in the csv is used in the json file to identify its members.
Only a basic processing pipeline is possible here, see the –help of the script to know which are available.
Example usage:
$ disamby --col-headers name,address \
--index inv_id \
--threshold .7 \
--prep APNX \
input.csv output.json
disamby.core module¶
Main module.
-
class
disamby.core.
Disamby
(data: typing.Union[pandas.core.frame.DataFrame, pandas.core.series.Series] = None, preprocessors: list = None, field: str = None)[source]¶ Bases:
object
Class for disambiguation fitting, scoring and ranking of potential matches
A Disamby instance stores the pre-processing pipeline applied to the strings for a given field as well as the the computed frequencies from the entire corpus of strings to match against. Disamby can be instantiated either with not arguments, with a list of strings, pandas.Series or pandas.DataFrame. This triggers the immediate call to the fit method, whose doc explains the parameters.
Examples
>>> import pandas as pd >>> import disamby.preprocessors as pre >>> df = pd.DataFrame( ... {'a': ['Luca Georger', 'Luca Geroger', 'Adrian Sulzer'], ... 'b': ['Mira, 34, Augsburg', 'Miri, 34, Augsburg', 'Milano, 34'] ... }, index=['L1', 'L2', 'O1'] ... ) >>> pipeline = [ ... pre.normalize_whitespace, ... pre.remove_punctuation, ... pre.trigram ... ] >>> dis = Disamby(df, pipeline) >>> dis.disambiguated_sets(threshold=0.5, verbose=False) [{'L2', 'L1'}, {'O1'}]
-
alias_graph
(threshold=0.7, verbose=True, weights=None, **kwargs) → networkx.classes.digraph.DiGraph[source]¶ This function creates the directed network connecting an instance to an other through a directed edge if the the target instance has a similarity score above the threshold.
Parameters: - weights –
- threshold (float) – between 0 and 1
- verbose (whether to show the progressbar) –
- kwargs – arguments to pass to the score function (i.e. offset, smoother)
Returns: Return type: DiGraph
-
fields
¶
-
find
(idx, threshold=0.0, weights: dict = None, **kwargs) → list[source]¶ returns the list of scored instances which have a score above the threshold. Note that strings which do not share any token are omitted since their score is 0 by default.
Parameters: - idx – index of the record to find
- threshold –
- weights (dict) –
-
fit
(data: typing.Union[pandas.core.frame.DataFrame, pandas.core.series.Series], preprocessors: list, field: str = None)[source]¶ Computes the frequencies of the terms by field.
Parameters: - data (pandas.DataFrame, pandas.Series or list of strings) – list of strings or pandas.DataFrame if dataframe is given then the field defaults to the column name
- preprocessors (list) – list of functions to apply in that order note the first function must accept a string, the other functions must be such that a pipeline is possible the result is a tuple of strings.
- field (str) – string identifying which field this data belongs to
Examples
>>> import pandas as pd >>> from disamby.preprocessors import split_words >>> df = pd.DataFrame( ... {'a': ['Luca Georger', 'Luke Geroge', 'Adrian Sulzer'], ... 'b': ['Mira, 34, Augsburg', 'Miri, 32', 'Milano, 34'] ... }) >>> dis = Disamby() >>> prep = [split_words] >>> dis.fit(df, prep)
-
id_potential
(term: typing.Union[tuple, str], field: str, smoother: str = None, offset=0) → dict[source]¶ Computes the weights of the words based on the observed frequency and normalized.
Parameters: - term (str, tuple) – term to look for or a tuple of proper tokens
- field (str) – field the word falls under
- smoother (str (optional)) – one of {None, ‘offset’, ‘log’}
- offset (int) – offset to add to count only needed for smoothers ‘log’ and ‘offset’
Returns: Return type: float
-
static
pre_process
(base_name, functions: list)[source]¶ apply every function consecutively to base_name
-
score
(term: str, other_term: str, field: str, smoother=None, offset=0) → float[source]¶ Computes the score between the two strings using the frequency data
Parameters: - term (str) – term to search for
- other_term (str) – the other term to compare too
- field (str) – the name of the column to which this term belongs
- smoother (str (optional)) – one of {None, ‘offset’, ‘log’}
- offset (int) – offset to add to count only needed for smoothers ‘log’ and ‘offset’
Returns: Return type: float
Notes
The score is not commutative (i.e. score(A,B) != score(B,A))
-
disamby.preprocessors module¶
This module contains the various string preprocessors
-
disamby.preprocessors.
compact_abbreviations
(string: str) → str[source]¶ Removes dots between single letters and concatenates them
Parameters: string – Returns: Return type: str Examples
>>> compact_abbreviations('an other A.B.M this') 'AN OTHER ABM THIS'
-
disamby.preprocessors.
normalize_whitespace
(string: str) → str[source]¶ removes duplicates whitespaces as well as replace tabs and newlines with a space
Parameters: string – Returns: Return type: str Examples
>>> normalize_whitespace('this is a long string') 'THIS IS A LONG STRING'
-
disamby.preprocessors.
ngram
(string: str, n: int) → tuple[source]¶ constructs all possible ngrams from the given string. If the string is shorter then the n then the string is returned
Parameters: - string –
- n (int) – value must be larger at least 2
Returns: Return type: tuple of strings
Examples
>>> ngram('this', 2) ('th', 'hi', 'is')
-
disamby.preprocessors.
split_words
(string: str) → tuple[source]¶ splits words on whitespace. This function is more reliable then .split(‘ ‘) since it works with any whitespace character (i.e. those recognized by regex)
Parameters: string – Returns: Return type: tuple of strings Examples
>>> len(split_words('a new day')) 3
Module contents¶
Top-level package for disamby.