disamby package

Submodules

disamby.cli module

Console script for disamby.

With the cli it is possible to carry out the disambiguation on the command line. The script returns a json file with a name taken from the first field as the key and a list of indices belonging to the same disambiguated cluster.

{'International Business Machines Inc': [1, 2, 5, 34, 90],
'Samsung': [4, 123],
'...': [...],
 ...}

To carry out the disambiguation you will need a csv file with proper headings and optionally an index column. If not index column is specified then the position in the csv is used in the json file to identify its members.

Only a basic processing pipeline is possible here, see the –help of the script to know which are available.

Example usage:

$ disamby --col-headers name,address \
            --index inv_id \
            --threshold .7 \
            --prep APNX \
            input.csv output.json

disamby.core module

Main module.

class disamby.core.Disamby(data: typing.Union[pandas.core.frame.DataFrame, pandas.core.series.Series] = None, preprocessors: list = None, field: str = None)[source]

Bases: object

Class for disambiguation fitting, scoring and ranking of potential matches

A Disamby instance stores the pre-processing pipeline applied to the strings for a given field as well as the the computed frequencies from the entire corpus of strings to match against. Disamby can be instantiated either with not arguments, with a list of strings, pandas.Series or pandas.DataFrame. This triggers the immediate call to the fit method, whose doc explains the parameters.

Examples

>>> import pandas as pd
>>> import disamby.preprocessors as pre
>>> df = pd.DataFrame(
... {'a': ['Luca Georger', 'Luca Geroger', 'Adrian Sulzer'],
... 'b': ['Mira, 34, Augsburg', 'Miri, 34, Augsburg', 'Milano, 34']
... }, index=['L1', 'L2', 'O1']
... )
>>> pipeline = [
...     pre.normalize_whitespace,
...     pre.remove_punctuation,
...     pre.trigram
... ]
>>> dis = Disamby(df, pipeline)
>>> dis.disambiguated_sets(threshold=0.5, verbose=False)
[{'L2', 'L1'}, {'O1'}]
alias_graph(threshold=0.7, verbose=True, weights=None, **kwargs) → networkx.classes.digraph.DiGraph[source]

This function creates the directed network connecting an instance to an other through a directed edge if the the target instance has a similarity score above the threshold.

Parameters:
  • weights
  • threshold (float) – between 0 and 1
  • verbose (whether to show the progressbar) –
  • kwargs – arguments to pass to the score function (i.e. offset, smoother)
Returns:

Return type:

DiGraph

disambiguated_sets(threshold=0.7, verbose=True, weights=None, **kwargs)[source]
fields
find(idx, threshold=0.0, weights: dict = None, **kwargs) → list[source]

returns the list of scored instances which have a score above the threshold. Note that strings which do not share any token are omitted since their score is 0 by default.

Parameters:
  • idx – index of the record to find
  • threshold
  • weights (dict) –
fit(data: typing.Union[pandas.core.frame.DataFrame, pandas.core.series.Series], preprocessors: list, field: str = None)[source]

Computes the frequencies of the terms by field.

Parameters:
  • data (pandas.DataFrame, pandas.Series or list of strings) – list of strings or pandas.DataFrame if dataframe is given then the field defaults to the column name
  • preprocessors (list) – list of functions to apply in that order note the first function must accept a string, the other functions must be such that a pipeline is possible the result is a tuple of strings.
  • field (str) – string identifying which field this data belongs to

Examples

>>> import pandas as pd
>>> from disamby.preprocessors import split_words
>>> df = pd.DataFrame(
... {'a': ['Luca Georger', 'Luke Geroge', 'Adrian Sulzer'],
... 'b': ['Mira, 34, Augsburg', 'Miri, 32', 'Milano, 34']
... })
>>> dis = Disamby()
>>> prep = [split_words]
>>> dis.fit(df, prep)
id_potential(term: typing.Union[tuple, str], field: str, smoother: str = None, offset=0) → dict[source]

Computes the weights of the words based on the observed frequency and normalized.

Parameters:
  • term (str, tuple) – term to look for or a tuple of proper tokens
  • field (str) – field the word falls under
  • smoother (str (optional)) – one of {None, ‘offset’, ‘log’}
  • offset (int) – offset to add to count only needed for smoothers ‘log’ and ‘offset’
Returns:

Return type:

float

static pre_process(base_name, functions: list)[source]

apply every function consecutively to base_name

score(term: str, other_term: str, field: str, smoother=None, offset=0) → float[source]

Computes the score between the two strings using the frequency data

Parameters:
  • term (str) – term to search for
  • other_term (str) – the other term to compare too
  • field (str) – the name of the column to which this term belongs
  • smoother (str (optional)) – one of {None, ‘offset’, ‘log’}
  • offset (int) – offset to add to count only needed for smoothers ‘log’ and ‘offset’
Returns:

Return type:

float

Notes

The score is not commutative (i.e. score(A,B) != score(B,A))

class disamby.core.ScoredElement(index, score)

Bases: tuple

index

Alias for field number 0

score

Alias for field number 1

disamby.preprocessors module

This module contains the various string preprocessors

disamby.preprocessors.compact_abbreviations(string: str) → str[source]

Removes dots between single letters and concatenates them

Parameters:string
Returns:
Return type:str

Examples

>>> compact_abbreviations('an other A.B.M this')
'AN OTHER ABM THIS'
disamby.preprocessors.normalize_whitespace(string: str) → str[source]

removes duplicates whitespaces as well as replace tabs and newlines with a space

Parameters:string
Returns:
Return type:str

Examples

>>> normalize_whitespace('this is a          long  string')
'THIS IS A LONG STRING'
disamby.preprocessors.ngram(string: str, n: int) → tuple[source]

constructs all possible ngrams from the given string. If the string is shorter then the n then the string is returned

Parameters:
  • string
  • n (int) – value must be larger at least 2
Returns:

Return type:

tuple of strings

Examples

>>> ngram('this', 2)
('th', 'hi', 'is')
disamby.preprocessors.trigram(string: str) → tuple[source]
disamby.preprocessors.split_words(string: str) → tuple[source]

splits words on whitespace. This function is more reliable then .split(‘ ‘) since it works with any whitespace character (i.e. those recognized by regex)

Parameters:string
Returns:
Return type:tuple of strings

Examples

>>> len(split_words('a new day'))
3
disamby.preprocessors.remove_punctuation(word: str) → str[source]

removes all punctuation symbols from the string

Parameters:word (str) –
Returns:
Return type:str

Examples

>>> remove_punctuation('.has -a .few!')
'has a few'
disamby.preprocessors.nword(word: str, k: int) → tuple[source]

concatenates k consecutive words into a tuple

Parameters:
  • word
  • k
Returns:

Return type:

tuple of strings

Examples

>>> nword('this that the other', 2)
('thisthat', 'thatthe', 'theother')

Module contents

Top-level package for disamby.