mdcgenpy.clusters package

Submodules

mdcgenpy.clusters.distributions module

class mdcgenpy.clusters.distributions.Distribution(f, **kwargs)[source]

Bases: object

mdcgenpy.clusters.distributions.check_input(distributions)[source]

Checks if the input distributions are valid. That is, check if they are either strings or functions. If they are strings, also check if they are contained in distributions_list.

Parameters:distributions (list of list of (str or function)) – Distributions given as input.
Returns:Functions for the distributions given as input.
Return type:(list of list of function)
mdcgenpy.clusters.distributions.distributions_list = {'gamma': <function <lambda>>, 'gap': <function <lambda>>, 'gaussian': <function <lambda>>, 'logistic': <function <lambda>>, 'normal': <function <lambda>>, 'triangular': <function <lambda>>, 'uniform': <function <lambda>>}

List of distributions for which you can just provide a string as input.

mdcgenpy.clusters.distributions.gap(shape, param)[source]
mdcgenpy.clusters.distributions.get_dist_function(d)[source]

Transforms distribution name into respective function.

Parameters:d (str or function) – Input distribution str/function.
Returns:Actual function to compute the intended distribution.
Return type:function
mdcgenpy.clusters.distributions.valid_distributions = ['gaussian', 'normal', 'triangular', 'gap', 'logistic', 'uniform', 'gamma']

List of valid strings for distributions.

mdcgenpy.clusters.generate module

mdcgenpy.clusters.generate.compute_batch(clus_cfg, n_samples)[source]

Generates one batch of data.

Parameters:
  • clus_cfg (clusters.DataConfig) – Configuration.
  • n_samples (int) – Number of samples in the batch.
Returns:

Generated sample.

Return type:

np.array

mdcgenpy.clusters.generate.generate_clusters(clus_cfg, batch_size=0)[source]

Generate data.

Parameters:
  • clus_cfg (clusters.DataConfig) – Configuration.
  • batch_size (int) – Number of samples for each batch.
Yields:

np.array – Generated samples. np.array: Labels for the samples.

mdcgenpy.clusters.generate.generate_mass(clus_cfg)[source]

Get the number of samples to generate for each cluster.

Parameters:clus_cfg (clusters.DataConfig) – Configuration
Returns:
Array with len == nr of clusters, where each entry is the number of samples in the corresponding
to generate in the corresponding cluster.
Return type:np.array
mdcgenpy.clusters.generate.get_rotation_matrix(n_feats)[source]
mdcgenpy.clusters.generate.locate_centroids(clus_cfg)[source]

Generate locations for the centroids of the clusters.

Parameters:clus_cfg (clusters.DataConfig) – Configuration.
Returns:Matrix (n_clusters, n_feats) with positions of centroids.
Return type:np.array

Module contents

class mdcgenpy.clusters.Cluster(cfg, idx, corr_matrix=None)[source]

Bases: object

Contains the parameters of an individual cluster.

__init__(cfg, idx, corr_matrix=None)[source]
Parameters:
  • cfg (ClusterGenerator) – Configuration of the data.
  • idx (int) – Index of a cluster.
  • corr_matrix (np.array) – Valid correlation matrix to use in this cluster.
compactness_factor
corr
distributions
generate_data(samples)[source]
mv
n_feats
n_noise
rotate
scale
settables = ['distributions', 'mv', 'corr', 'compactness_factor', 'scale', 'rotate', 'n_noise']

List of settable properties of Cluster. These are the parameters which can be set at a cluster level, and override the parameters of the cluster generator.

class mdcgenpy.clusters.ClusterGenerator(seed=1, n_samples=2000, n_feats=2, k=5, min_samples=0, possible_distributions=None, distributions=None, mv=True, corr=0.0, compactness_factor=0.1, alpha_n=1, scale=True, outliers=50, rotate=True, add_noise=0, n_noise=None, ki_coeff=3.0, **kwargs)[source]

Bases: object

Structure to handle the input and create clusters according to it.

__init__(seed=1, n_samples=2000, n_feats=2, k=5, min_samples=0, possible_distributions=None, distributions=None, mv=True, corr=0.0, compactness_factor=0.1, alpha_n=1, scale=True, outliers=50, rotate=True, add_noise=0, n_noise=None, ki_coeff=3.0, **kwargs)[source]
Parameters:
  • seed (int) – Seed for the generation of random values. Useful for consistency.
  • n_samples (int) – Number of samples to generate.
  • n_feats (int) – Number of dimensions/features for each sample.
  • k (int or list of int) – Number of clusters to generate. If input is a list, each element in it specifies the number of samples in each cluster. In that case, the number of clusters will be the length of the list.
  • min_samples (int) – Minimum number of samples in each cluster. If 0, the default minimum for a cluster with \(N\) samples is \(N/(\text{ki_coeff}*k)\).
  • possible_distributions (list) –

    List of distributions to randomly choose from. Each element in this list must either be a valid str (valid str are defined in valid_distributions OR a function which implements the distribution OR an instance of Distribution.

    This parameter is overridden by distributions, when set.

  • distributions (str or function or distributions.Distribution or list) –

    Distribution to be used. If list, its length must be k, and each element in the list must either be a valid str (indicating the distribution to be used) OR a function which implements the distribution OR a list of str/functions with length n_feats.

    Instances of Distribution can also be used.

    Valid str are defined in valid_distributions.

  • mv (bool or list of bool or None) –

    Multivariate distributions or distributions defining intra-distances. If True, distributions define feature values (multivariate). If False, distributions define intra-distances.

    If None, this choice is made at random.

    If a list, its length must be k, and each value in the list applies to one cluster.

  • corr (float or list of float) –

    Maximum (in absolute value) correlation between variables.

    If a list, its length must be k, and each value in the list applies to one cluster.

  • compactness_factor (float or list of float) –

    Compactness factor.

    If a list, its length must be k, and each value in the list applies to one cluster.

  • alpha_n (float or list of float) –

    Determines grid hyperplanes. If \(\alpha_n > 0\), the number of hyperplanes is a factor of \(\alpha_n * \left \lfloor{1 + \frac{k}{\log(k)}}\right \rfloor\).

    If \(\alpha_n < 0\), the number of hyperplanes is \(|\alpha_n|\).

    If a list, its length must be n_feats, and each value in the list applies to one dimension.

  • scale (bool or list of bool) –

    Optimizes cluster separation based on grid size. If True, scale based on min distance between grid hyperplanes. If False, scale based on max distance between grid hyperplanes.

    If None, does not scale.

    If a list, its length must be k, and each value in the list applies to one cluster.

  • outliers (int) – Number of outliers.
  • rotate (bool or list of bool) –

    If True, clusters can rotate.

    If a list, its length must be k, and each value in the list applies to one cluster.

  • add_noise (int) – Add this number of noisy dimensions.
  • n_noise (list) –

    Parameter that manages noisy dimensions.

    If a list of int (of size \(\leq\) n_feats, and each element is \(\geq 0\) and \(<\) n_feats), each dimension listed (0-indexed) will have only noise.

    If a list of list of int (of length k, and each element is a list of length \(\leq\) n_feats, with values \(\geq 0\) and \(<\) n_feats), each list indicates the noisy dimensions for a particular cluster.

  • ki_coeff (float) – Coefficient used to define the default minimum number of samples per cluster.
generate_data(batch_size=0)[source]
get_cluster_configs()[source]
mass
class mdcgenpy.clusters.ScheduledClusterGenerator(schedule, *args, **kwargs)[source]

Bases: mdcgenpy.clusters.ClusterGenerator

This cluster generator takes a schedule and all the ClusterGenerator arguments, and activates only the specified clusters in the schedule, for each time step. A time step is defined as one get call to self.mass, which is done when generating each new batch. That is, one time step is one call to generate.compute_batch().

__init__(schedule, *args, **kwargs)[source]
Parameters:
mass