Hello. We were in the atrium of the NCSA building,
where new features of both hardware and software are explored and evaluated in
order to improve the efficacy of high performance computing and big data analysis.
Previous work like this at NCSA resulted in the early web browsers and web servers.
A supercomputer made from video game consoles,
an immersive CAVE Visualization Environment,
and a specialized GIS computational platform.
Now, NCSA is increasingly exploring data intensive hardware and
software to aid scientists around
the world as they struggle with ever increasing amounts of data.
In this module, we look to improve the performance of
machine learning algorithms by employing feature engineering.
Feature engineering encompasses any aspect of
selecting or creating the features used to generate a machine learning model.
Often this task is overlooked in favor of focusing
on trying out the latest classification or regression algorithm.
However, time spent on feature engineering often pays back amazing dividends.
In part, this is because any machine learning algorithm will work faster,
and be easier to understand if the most important features are used to make predictions.
And this model often performs nearly as well as and
sometimes even better than a model generated from more data.
One overlooked aspect of feature engineering is
the selection of features based on ethical or moral concerns.
This can be especially important and regulated for public industries.
For example, if you were determined credit ratings or university admissions,
you would not want to base a decision on ethnicity or gender.
While it might be easy to exclude those features from an analysis,
how do you ensure there are not hidden correlations between
these features and other features that remain in the data set,
such as home address or occupation.
To explore these issues in more detail the first lesson includes
articles on the ethics of machine learning and artificial intelligence.
As automation becomes more prevalent in our society,
this issue will become more important for
both individuals and companies who wish to behave ethically,
and to mitigate potential liabilities.
The next lesson introduces feature selection where algorithms determine
the set of features that optimize the performance of a given machine learning algorithm.
There are a number of such methods some of which work
in general and others that are tied to specific algorithms.
For example, algorithms that implement regularization can
identify features that are left unpenalized and are thus more important.
While tree-based algorithms directly measure feature importance.
Other techniques can either inadvertently select different combinations of
features or drop features and continuously evaluate the performance of an algorithm.
To simplify this process,
these tasks are often implemented as part of a machine learning pipeline.
The third lesson focuses on principal component analysis,
which is a popular dimensional reduction technique.
This algorithm computes new features that are
combinations of the existing features in the original data set.
This process also computes the amount of variance in
the original data set that is explained by each new feature.
As a result, we can reduce the number of new features to meet
a predefined data reconstruction quality from our new set of features.
And in the process, generate a smaller set of
features with which to use in our machine learning algorithm.
The final lesson introduces a similar approach but
where we attempt to learn the distribution of
data in a potentially complex geometry
in order to find a much smaller set of descriptive features.
This type of algorithm is known as manifold learning,
which is an unsupervised technique to map
potentially nonlinear structures to many fewer dimensions.
Manifold learning is very popular as a visualization technique since
a high dimensional dataset can be mapped to two or three dimensions.
Thus allowing a complex dataset to be easily visualized.
Hopefully, you were intrigued by the potential of feature engineering.
This technique can not only improve your performance on a machine learning task,
but also aid in making the result understandable to a broader community,
both by reducing the number of dimensions included in the analysis,
as well as by enabling the visualization of
high dimensional data in a two or three dimensional plot. Good luck.