Hierarchical Pipeline Optimization
hpopt
stands for Hierarchical Pipeline Optimization, a Python module for automatic configuration of software pipelines, specifically, machine learning pipelines.
A pipeline is simply a sequence or chain of processes (the pipeline steps) that iteratively transforms an input into an output. A machine learning pipeline is often composed of steps such as feature extraction, preprocessing, vectorization, dimensionality reduction and classification (or regression). These pipelines often have many configurable hyper-parameters, like the actual classification algorithm to use, or the strength of a regularization factor. hpopt
gives you tools to define custom pipelines, with any degree of complexity, and automatically find the best configuration for a given problem (read, for a given dataset). This task (finding the optimal configuration of machine learning pipeline) is often called Auto-ML.
The hierarchical part comes from the fact that in hpopt
you define the set of all possible pipelines using a hierarchical structure. In formal terms it is a context-free grammar, but informally, it is basically a hierarchical definition that starts top-level steps (such as preprocessing, vectorization, etc.), which are themselves defined recursively in terms of simpler steps, down to algorithms and their hyper-parameters.
If you want to know more about the inner workings of hpopt
we recommend reaing our paper. Here is the source code click here
Quick start
hpopt
can be used in many different ways, with varying degrees of customization. The easiest way to use hpopt
is a black-box Auto-ML tool. hpopt
provides several scikit-learn
compatible estimators that you can readily use:
from hpopt.sklearn import SklearnClassifier
X, y = get_dataset() # <-- custom data logic
clf = SklearnClassifier()
clf.fit(X, y)