haychecker.dhc package

Submodules

haychecker.dhc.metrics module

Module containing metrics for the distributed version of hay_checker.

haychecker.dhc.metrics.completeness(columns=None, df=None)[source]

If a df is passed, the completeness metric will be run and result returned as a list of scores, otherwise an instance of the Task class containing this metric wil be returned, to be later run (possibly after adding to it other tasks/metrics).

Parameters:
  • columns (list) – Columns on which to run the metric, None to run the completeness metric on the whole table.
  • df (DataFrame) – Dataframe on which to run the metric, None to have this function return a Task instance containing this metric to be run later.
Returns:

Either a list of scores or a Task instance containing this metric (with these parameters) to be run later.

Return type:

list/Task

haychecker.dhc.metrics.constraint(when, then, conditions=None, df=None)[source]

If a df is passed, the constraint metric will be run and result returned as a list of scores, otherwise an instance of the Task class containing this metric wil be returned, to be later run (possibly after adding to it other tasks/metrics).

Parameters:
  • when (list) – A list of columns in the df to use as the precondition of a functional constraint. No column should be in both when and then.
  • then (list) – A list of columns in the df to use as the postcondition of a functional constraint. No column should be in both when and then.
  • conditions (list) – Conditions on which to filter data before applying the metric.
  • df (DataFrame) – Dataframe on which to run the metric, None to have this function return a Task instance containing this metric to be run later.
Returns:

Either a list of scores or a Task instance containing this metric (with these parameters) to be run later.

Return type:

list/Task

haychecker.dhc.metrics.deduplication(columns=None, df=None)[source]

If a df is passed, the deduplication metric will be run and result returned as a list of scores, otherwise an instance of the Task class containing this metric wil be returned, to be later run (possibly after adding to it other tasks/metrics).

Parameters:
  • columns (list) – Columns on which to run the metric, None to run the deduplication metric on the whole table (deduplication on rows).
  • df (DataFrame) – Dataframe on which to run the metric, None to have this function return a Task instance containing this metric to be run later.
Returns:

Either a list of scores or a Task instance containing this metric (with these parameters) to be run later.

Return type:

list/Task

haychecker.dhc.metrics.deduplication_approximated(columns, df=None)[source]

If a df is passed, the deduplication_approximated metric will be run and result returned as a list of scores, otherwise an instance of the Task class containing this metric wil be returned, to be later run (possibly after adding to it other tasks/metrics). Differently from deduplication, here columns must be specified (deduplication_approximated does not work on a whole row level).

Parameters:columns (list :param df: Dataframe on which to run the metric, None to have this function return a Task instance containing) – Columns on which to run the metric.

this metric to be run later. :type df: DataFrame :return: Either a list of scores or a Task instance containing this metric (with these parameters) to be

run later.
Return type:list/Task
haychecker.dhc.metrics.entropy(column, df=None)[source]

If a df is passed, the entropy metric will be run and result returned as a list of scores, otherwise an instance of the Task class containing this metric wil be returned, to be later run (possibly after adding to it other tasks/metrics).

Parameters:
  • column (str/int) – Column on which to run the metric.
  • df (DataFrame) – Dataframe on which to run the metric, None to have this function return a Task instance containing this metric to be run later.
Returns:

Either a list of scores or a Task instance containing this metric (with these parameters) to be run later.

Return type:

list/Task

haychecker.dhc.metrics.freshness(columns, df=None, dateFormat=None, timeFormat=None)[source]

If a df is passed, the freshness metric will be run and result returned as a list of scores, otherwise an instance of the Task class containing this metric wil be returned, to be later run (possibly after adding to it other tasks/metrics). Use https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html directives to express formats.

Parameters:
  • columns (list) – Columns on which to run the metric, columns of type string will be casted to timestamp using the dateFormat or timeFormat argument.
  • dateFormat (str) – Format in which the values in columns are if those columns are of type string; otherwise they must be of type date or timestamp. Use this parameter if you are interested in a result in terms of days. Either dateFormat or timeFormat must be passed, but not both. Use https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html directives to express formats.
  • timeFormat (str) – Format in which the values in columns are if those columns are of type string; otherwise they must be of type timestamp. Use this parameter if you are interested in results in terms of seconds. Either dateFormat or timeFormat must be passed, but not both. Use https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html directives to express formats.
  • df (DataFrame) – Dataframe on which to run the metric, None to have this function return a Task instance containing this metric to be run later.
Returns:

Either a list of scores or a Task instance containing this metric (with these parameters) to be run later.

Return type:

list/Task

haychecker.dhc.metrics.grouprule(columns, having, conditions=None, df=None)[source]

If a df is passed, the groupRule metric will be run and result returned as a list of scores, otherwise an instance of the Task class containing this metric wil be returned, to be later run (possibly after adding to it other tasks/metrics).

Parameters:
  • columns – Columns on which to run the metric, grouping data.
  • conditions (list) – Conditions on which to run the metric, filtering data before grouping, can be None.
  • having (list) – Conditions to apply to groups.
  • df (DataFrame) – Dataframe on which to run the metric, None to have this function return a Task instance containing this metric to be run later.
Returns:

Either a list of scores or a Task instance containing this metric (with these parameters) to be run later.

Return type:

list/Task

haychecker.dhc.metrics.mutual_info(when, then, df=None)[source]

If a df is passed, the mutual_info metric will be run and result returned as a list of scores, otherwise an instance of the Task class containing this metric wil be returned, to be later run (possibly after adding to it other tasks/metrics).

Parameters:
  • when (str/int) – First column on which to compute MI.
  • then (str/int) – Second column on which to compute MI.
  • df (DataFrame) – Dataframe on which to run the metric, None to have this function return a Task instance containing this metric to be run later.
Returns:

Either a list of scores or a Task instance containing this metric (with these parameters) to be run later.

Return type:

list/Task

haychecker.dhc.metrics.rule(conditions, df=None)[source]

If a df is passed, the rule metric will be run and result returned as a list of scores, otherwise an instance of the Task class containing this metric wil be returned, to be later run (possibly after adding to it other tasks/metrics).

Parameters:
  • conditions (list) – Conditions on which to run the metric.
  • df (DataFrame) – Dataframe on which to run the metric, None to have this function return a Task instance containing this metric to be run later.
Returns:

Either a list of scores or a Task instance containing this metric (with these parameters) to be

Return type:

list/Task

haychecker.dhc.metrics.timeliness(columns, value, df=None, dateFormat=None, timeFormat=None)[source]

If a df is passed, the timeliness metric will be run and result returned as a list of scores, otherwise an instance of the Task class containing this metric wil be returned, to be later run (possibly after adding to it other tasks/metrics). Use https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html directives to express formats.

Parameters:
  • columns (list) – Columns on which to run the metric, columns of type string will be casted to timestamp using the dateFormat or timeFormat argument.
  • value (str) – Value used to run the metric, confronting values in the specified columns against it.
  • dateFormat (str) – Format in which the value (and values in columns, if they are of string type) are; used to cast columns if they contain dates as strings. Either dateFormat or timeFormat must be passed, but not both. Use https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html directives to express formats.
  • timeFormat (str) – Format in which the value (and values in columns, if they are of string type) are; used to cast columns if they contain dates as strings. Either dateFormat or timeFormat must be passed, but not both. Use https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html directives to express formats.
  • df (DataFrame) – Dataframe on which to run the metric, None to have this function return a Task instance containing this metric to be run later.
Returns:

Either a list of scores or a Task instance containing this metric (with these parameters) to be run later.

Return type:

list/Task

haychecker.dhc.task module

Class extending the _Task class from the common scripts. It contains metrics that can be run on different data.

class haychecker.dhc.task.Task(metrics_params=[], allow_casting=True)[source]

Bases: haychecker._common._task._Task

Class to contain defined metrics to run them on different data and/or at different times. An instance of it can contain any number of metrics to be run, trying to run them together/on the same pass on the data instead of one at a time when possible. Other tasks can be added to a Task, or metrics as a dict describing the metric can be added to a task, or a list of those metrics. Once run on a df, it returns a list of results, which are the list of contained metrics (in the same order) with a score field added for each metric, containing results/scores of that metric.

run(df)[source]

For each metric check its parameters for run time correctness (i.e. column with those name existing in the df, etc.), then perform the required computations, return results as a list of dictionaries identical to the metrics contained in the task, each with a field “scores” added, mapped to a list of scores related to the metric.

Parameters:df (DataFrame) – DataFrame
Returns:List of metrics with their scores added as a field.
Return type:list

Module contents