June 2007
Preparing Experimental Data for Mathcad Analysis: Dealing with Categorical Variables
by Frank Smietana

This article is the third in a series that examines specific aspects of real world data set preparation and demonstrates the functionality provided in Mathcad to optimally expose the information content of a data set. This article discusses the challenges presented by categorical variables and includes some handy transformations such as circular and binary representations for turning this type of data into useful information.


The first two articles in the series are in the April and May issues of PTC Express, under Knowledge Base Exclusive.


Flavors of Categorical Variables


Categorical variables consist of a finite number of categories and can be either numeric or alphanumeric.  This definition tends to be subjective, it is up to the data analyst to determine "finite number", typically no more than 30.  Continuous variables by comparison are comprised of an infinite number of values and are strictly numeric.

Nominal variables are the lowest categorical variable in terms of information content. Nominal variables can be thought of as names or labels and have no numerical equivalent. Since nominal variables have no order or magnitude in a numerical sense, the only relevant mathematical relationship between nominal values is equality and inequality. This caveat should not diminish the importance of nominal variables, you simply need to transform them in such a way that whatever information they do contain can be used effectively by the modeling engine. Gender and zip code are two examples of nominal variables.


Ordinal variables are next in the information content hierarchy. As their name implies, they can be numerically ordered, so additional relationships such as "greater than" and "less than" can be applied to ordinal variables. It is important to note that the values of truly ordinal variables do not infer any additional information, such as magnitude. All you can say about an ordinal categorical variable comprised of the values 1, 2, and 10 is that 10 is greater than 2, not that category 10 is some order of magnitude greater than category 2. Examples of ordinal variables include highest attained educational level and socioeconomic status.


Transforming Categorical Variables


Take a look at some ways to transform categorical variables into representations that optimize their usefulness for statistical modeling.

Assume you are trying to predict pump failure (Failed)using a dataset that
contains the categorical variables Day and Installer. Most statistical modeling methods can not deal with alphanumeric data, so the Installer variable needs to be transformed. Although the Day variable is numeric, you can encode it in a manner that makes its information content more accessible to the modeling method employed.



Creating a Mapping Table


Before you can transform nominal variables to more suitable representations, you need to create a mapping table that associates each value of a particular nominal variable with its corresponding numeric transformation. In the process, it helps to quantify the number of occurrences of each value or category. This exercise alerts you to sparse categories and gives you a rough feel for the variable's distribution. The following routine takes a data table and a variable name as input and generates a two-column matrix containing each nominal value and its respective frequency in the data set:



Applying this function to the two categorical variables found in the data set creates the following summary tables:






Dealing with Circular Discontinuities


Even though the Day variable is numeric and ordinal, another critical transformation is required. Many time based variables such as day and month are intrinsically circular. For example, December is as close to January as it is to November. It is important to present that information to the modeling method in a manner that makes use of that key information. By assigning 1 to Monday and 5 to Friday it would appear that Monday and Friday are far apart, when in reality they are adjacent entities in the domain of business days.



To expose a variable's inherent circularity, the circle transform is used. This simple routine maps numerated categories onto equidistant points on a circle. The tradeoff is that the original variable is now represented by two proxy variables, the sine and cosine of the category's position on the circle, instead of a single variable.

 


The resulting mapping table shows the two new proxy variables that are created by the circle transformation, Day_0 and Day_1:




Numeration Using Binary Representations


Under certain scenarios, a modeling engine finds a variable represented as several individual binary fields more useful than a corresponding non-binary representation. A number of techniques for numerating nominal variables generates some form of binary representation. A one-of-n encoding creates as many proxy variables as there are categories. The mapping table can be easily implemented by appending an identity matrix to the category list:





In a similar manner, thermometer encodings also generate as many proxy variables as categories:




Both one-of-n and thermometer encodings are rather expensive in terms of the increase in proxy variables that these transformations create. Use with caution, or only when the category count of a given nominal is low.


A far more efficient encoding can be performed using Gray codes. This workhorse transformation has wide applicability across several problem domains. Gray codes possess the interesting property of varying by only one bit between sequential values. For example, as you move from row to row in the following matrix, note that one and only one bit changes in each row.



The following function creates a mapping table associating a unique Gray code with each categorical value:

(See the Mathcad file for the full-size program.)


The resulting mapping generates far fewer proxy variables than the one-of-n or thermometer encodings.




Now that you have created mapping tables associating your raw categorical variables with transformed proxy variables, you need to assemble a final modeling data set that incorporates these proxy variables:


Run the mapProxy function once for each transformed variable. The proxy variables for Installer and Day appear in the leftmost columns of the final dataset, replacing their corresponding original variables, followed by the remaining, non-categorical variables.




The transformations discussed in this article help you expose the information content of categorical variables in an effective manner for statistical modeling. July's article explores time series data and presents some techniques for handling this special type of data.


Right-click to download Mathcad file (version Mathcad 11).

Was this article interesting? Let us know.







 


[PRINTER FRIENDLY VERSION]
HOME

Windchill 9.0 — A New Level of Productivity
PTC Updates
Tips of the Month
Mathcad Methods
Knowledge Base Exclusive
Webcasts & Events
Charting the Course with Pro/ENGINEER
Modeling Sensors on Aerospace Vehicles with Mathcad

Contact PTC | Privacy Policy | PTC Express Archive | Subscribe | Unsubscribe | Change Preferences | Edit Profile

This e-mail was sent to:   PTC, 140 Kendrick Street, Needham, MA 02494 USA
If you are unable to read this page correctly, please click here