The data consists of 8 feature columns. The date is available here.
- Develop the algorithm for the Principal Component Analysis (PCA) method and implement it programmatically.
- Conduct an analysis of experimental data using the Principal Component Analysis method.
- Load the data according to your variant. Display the data on the monitor as a table.
- Normalize (standardize) the original experimental data. Build a correlation matrix.
- Ensure that the correlation matrix significantly differs from the identity matrix.
- Calculate the projections of objects onto the principal components.
- Analyze the results of the Principal Component Analysis method.
- Check the equality of the sums of sample variances of the original features and the sample variances of projections onto the principal components.
- Determine the relative proportion of variance attributable to the principal components. Build a covariance matrix for projections onto the principal components.
- Based on the first M = 2 principal components, construct a scatter plot. Provide a meaningful interpretation of the first two principal components.
- Data was obtained from a .txt file.
- Exploratory Data Analysis (EDA) was conducted on the obtained data. Descriptive statistics were displayed, distribution histograms and boxplot graphs were constructed.
- The data were normalized using StandardScaler. Distribution histograms and boxplots were also created for the normalized data.
- For the normalized data, Pearson and Kendall correlation matrices were constructed and displayed, along with a covariance matrix, which resembled the correlation matrix due to the normalization.
- The value of d was calculated from the covariance matrix. From the theory: If the correlation matrix of the original data does not differ from the identity matrix (i.e.,
$(d \leq \chi^2)$ calculated at a given confidence level and degrees of freedom), then the application of the Principal Component Analysis method is not advisable. - Eigenvalues and eigenvectors were obtained using np.linalg.eig. The eigenvectors were used to project the original data onto the principal components, resulting in the Z matrix.
- The variance of the projected data and the original data was calculated, and they closely matched, indicating the correct implementation of the PCA method.
- The covariance matrix for Z was displayed.
- The relative proportion of the spread attributable to the main components and the relative share of the spread attributable to the first i components were calculated.
- A scatter plot for the first two principal components was constructed.
- The results of the PCA implemented in sklearn were compared, and the same patterns were observed.
- Another dimensionality reduction method, t-SNE, was applied, resulting in improved outcomes.
- Yet another dimensionality reduction method, UMAP, demonstrated the best results.