Trade & Ahead¶

Problem Statement¶

Context¶

The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.

It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.

Objective¶

Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have consulted me as a Data Scientist and provided me with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. My task is to analyze the data, group the stocks based on the attributes provided, and share insights about the characteristics of each group.

Data Dictionary¶

  • Ticker Symbol: An abbreviation used to uniquely identify publicly traded shares of a particular stock on a particular stock market
  • Company: Name of the company
  • GICS Sector: The specific economic sector assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • GICS Sub Industry: The specific sub-industry group assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
  • Current Price: Current stock price in dollars
  • Price Change: Percentage change in the stock price in 13 weeks
  • Volatility: Standard deviation of the stock price over the past 13 weeks
  • ROE: A measure of financial performance calculated by dividing net income by shareholders' equity (shareholders' equity is equal to a company's assets minus its debt)
  • Cash Ratio: The ratio of a company's total reserves of cash and cash equivalents to its total current liabilities
  • Net Cash Flow: The difference between a company's cash inflows and outflows (in dollars)
  • Net Income: Revenues minus expenses, interest, and taxes (in dollars)
  • Earnings Per Share: Company's net profit divided by the number of common shares it has outstanding (in dollars)
  • Estimated Shares Outstanding: Company's stock currently held by all its shareholders
  • P/E Ratio: Ratio of the company's current stock price to the earnings per share
  • P/B Ratio: Ratio of the company's stock price per share by its book value per share (book value of a company is the net difference between that company's total assets and total liabilities)

Importing necessary libraries and data¶

In [ ]:
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 yellowbrick==1.5 -q --user
In [ ]:
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.2 yellowbrick==1.5 -q --user
# !pip install --upgrade -q jinja2

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

In [2]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='darkgrid')

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to scale the data using z-score
from sklearn.preprocessing import StandardScaler

# to compute distances
from scipy.spatial.distance import cdist, pdist

# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# to suppress warnings
import warnings
warnings.filterwarnings("ignore")

Loading the Dataset¶

In [3]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [4]:
data = pd.read_csv('/content/drive/MyDrive/content/stock_data.csv')

Data Overview¶

  • Observations
  • Sanity checks

Data Shape¶

In [6]:
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns.")
There are 340 rows and 15 columns.

Displaying few rows of the dataset¶

In [7]:
data.sample(n=10, random_state=1)
Out[7]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
102 DVN Devon Energy Corp. Energy Oil & Gas Exploration & Production 32.000000 -15.478079 2.923698 205 70 830000000 -14454000000 -35.55 4.065823e+08 93.089287 1.785616
125 FB Facebook Information Technology Internet Software & Services 104.660004 16.224320 1.320606 8 958 592000000 3669000000 1.31 2.800763e+09 79.893133 5.884467
11 AIV Apartment Investment & Mgmt Real Estate REITs 40.029999 7.578608 1.163334 15 47 21818000 248710000 1.52 1.636250e+08 26.335526 -1.269332
248 PG Procter & Gamble Consumer Staples Personal Products 79.410004 10.660538 0.806056 17 129 160383000 636056000 3.28 4.913916e+08 24.070121 -2.256747
238 OXY Occidental Petroleum Energy Oil & Gas Exploration & Production 67.610001 0.865287 1.589520 32 64 -588000000 -7829000000 -10.23 7.652981e+08 93.089287 3.345102
336 YUM Yum! Brands Inc Consumer Discretionary Restaurants 52.516175 -8.698917 1.478877 142 27 159000000 1293000000 2.97 4.353535e+08 17.682214 -3.838260
112 EQT EQT Corporation Energy Oil & Gas Exploration & Production 52.130001 -21.253771 2.364883 2 201 523803000 85171000 0.56 1.520911e+08 93.089287 9.567952
147 HAL Halliburton Co. Energy Oil & Gas Equipment & Services 34.040001 -5.101751 1.966062 4 189 7786000000 -671000000 -0.79 8.493671e+08 93.089287 17.345857
89 DFS Discover Financial Services Financials Consumer Finance 53.619999 3.653584 1.159897 20 99 2288000000 2297000000 5.14 4.468872e+08 10.431906 -0.375934
173 IVZ Invesco Ltd. Financials Asset Management & Custody Banks 33.480000 7.067477 1.580839 12 67 412000000 968100000 2.26 4.283628e+08 14.814159 4.218620

Checking data types¶

In [8]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Ticker Symbol                 340 non-null    object 
 1   Security                      340 non-null    object 
 2   GICS Sector                   340 non-null    object 
 3   GICS Sub Industry             340 non-null    object 
 4   Current Price                 340 non-null    float64
 5   Price Change                  340 non-null    float64
 6   Volatility                    340 non-null    float64
 7   ROE                           340 non-null    int64  
 8   Cash Ratio                    340 non-null    int64  
 9   Net Cash Flow                 340 non-null    int64  
 10  Net Income                    340 non-null    int64  
 11  Earnings Per Share            340 non-null    float64
 12  Estimated Shares Outstanding  340 non-null    float64
 13  P/E Ratio                     340 non-null    float64
 14  P/B Ratio                     340 non-null    float64
dtypes: float64(7), int64(4), object(4)
memory usage: 40.0+ KB

Observations:

  • There are 4 columns as Object type. These will be changed to categorical.

Checking for duplicates and missing values¶

In [9]:
data.duplicated().sum()
Out[9]:
0

Observations:

  • There are no duplicated values in the Dataset
In [10]:
data.isna().sum()
Out[10]:
0
Ticker Symbol 0
Security 0
GICS Sector 0
GICS Sub Industry 0
Current Price 0
Price Change 0
Volatility 0
ROE 0
Cash Ratio 0
Net Cash Flow 0
Net Income 0
Earnings Per Share 0
Estimated Shares Outstanding 0
P/E Ratio 0
P/B Ratio 0

Observations:

  • There are no null values in the Dataset

Statistical summary of the dataset¶

In [11]:
data.describe(include='all').T
Out[11]:
count unique top freq mean std min 25% 50% 75% max
Ticker Symbol 340 340 AAL 1 NaN NaN NaN NaN NaN NaN NaN
Security 340 340 American Airlines Group 1 NaN NaN NaN NaN NaN NaN NaN
GICS Sector 340 11 Industrials 53 NaN NaN NaN NaN NaN NaN NaN
GICS Sub Industry 340 104 Oil & Gas Exploration & Production 16 NaN NaN NaN NaN NaN NaN NaN
Current Price 340.0 NaN NaN NaN 80.862345 98.055086 4.5 38.555 59.705 92.880001 1274.949951
Price Change 340.0 NaN NaN NaN 4.078194 12.006338 -47.129693 -0.939484 4.819505 10.695493 55.051683
Volatility 340.0 NaN NaN NaN 1.525976 0.591798 0.733163 1.134878 1.385593 1.695549 4.580042
ROE 340.0 NaN NaN NaN 39.597059 96.547538 1.0 9.75 15.0 27.0 917.0
Cash Ratio 340.0 NaN NaN NaN 70.023529 90.421331 0.0 18.0 47.0 99.0 958.0
Net Cash Flow 340.0 NaN NaN NaN 55537620.588235 1946365312.175789 -11208000000.0 -193906500.0 2098000.0 169810750.0 20764000000.0
Net Income 340.0 NaN NaN NaN 1494384602.941176 3940150279.327937 -23528000000.0 352301250.0 707336000.0 1899000000.0 24442000000.0
Earnings Per Share 340.0 NaN NaN NaN 2.776662 6.587779 -61.2 1.5575 2.895 4.62 50.09
Estimated Shares Outstanding 340.0 NaN NaN NaN 577028337.75403 845849595.417695 27672156.86 158848216.1 309675137.8 573117457.325 6159292035.0
P/E Ratio 340.0 NaN NaN NaN 32.612563 44.348731 2.935451 15.044653 20.819876 31.764755 528.039074
P/B Ratio 340.0 NaN NaN NaN -1.718249 13.966912 -76.119077 -4.352056 -1.06717 3.917066 129.064585

Observations:

  • All values are present.

Creating a copy of the Dataset¶

In [12]:
df = data.copy()

Convert all object type columns to category¶

In [13]:
df = df.apply(lambda col: col.astype('category') if col.dtype == 'object' else col)
In [14]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   Ticker Symbol                 340 non-null    category
 1   Security                      340 non-null    category
 2   GICS Sector                   340 non-null    category
 3   GICS Sub Industry             340 non-null    category
 4   Current Price                 340 non-null    float64 
 5   Price Change                  340 non-null    float64 
 6   Volatility                    340 non-null    float64 
 7   ROE                           340 non-null    int64   
 8   Cash Ratio                    340 non-null    int64   
 9   Net Cash Flow                 340 non-null    int64   
 10  Net Income                    340 non-null    int64   
 11  Earnings Per Share            340 non-null    float64 
 12  Estimated Shares Outstanding  340 non-null    float64 
 13  P/E Ratio                     340 non-null    float64 
 14  P/B Ratio                     340 non-null    float64 
dtypes: category(4), float64(7), int64(4)
memory usage: 58.1 KB

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What does the distribution of stock prices look like?
  2. The stocks of which economic sector have seen the maximum price increase on average?
  3. How are the different variables correlated with each other?
  4. Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
  5. P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?

Global Functions¶

In [15]:
def histogram_boxplot(df, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=df, x=feature, ax=ax_box2, showmeans=True, color="lightblue"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=df, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=df, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        df[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        df[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [16]:
def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

Univariate Analysis¶

Current Price¶

In [17]:
histogram_boxplot(df, 'Current Price')
No description has been provided for this image

Observations:

  • Current price is skewed to the right with few outliers in it. The ones detected will be kept since it seems to bring value to the data

Price Change¶

In [18]:
histogram_boxplot(df, 'Price Change')
No description has been provided for this image

Observations:

  • Price change follows a normal distribution with some outliers that will be kept as the bring value to the analysis.

Volatility¶

In [19]:
histogram_boxplot(df, 'Volatility')
No description has been provided for this image

Observations:

  • Volatility change follows a normal with slight skeweness to the right. It has some outliers that will be kept as the bring value to the analysis.

ROE¶

In [20]:
histogram_boxplot(df, 'ROE')
No description has been provided for this image

Observations:

  • ROE is fully skewed to the right. there are some outliers not affecting heavily the analysis.

Cash Ratio¶

In [21]:
histogram_boxplot(df, 'Cash Ratio')
No description has been provided for this image

Observations:

  • Cash ratio is skewed to the right. There are some outliers not affecting heavily the analysis.

Net Cash Flow¶

In [22]:
histogram_boxplot(df, 'Net Cash Flow')
No description has been provided for this image

Observations:

  • Net cash shows a normal distribution with some outliers to both sides. Outliers are being kept for the analysis.

Net Income¶

In [23]:
histogram_boxplot(df, 'Net Income')
No description has been provided for this image

Observations:

  • Net income shows a normal distribution with some outliers to both sides. Outliers are being kept for the analysis.

Earnings Per Share¶

In [24]:
histogram_boxplot(df, 'Earnings Per Share')
No description has been provided for this image

Observations:

  • Earnings per share shows a normal distribution with some outliers to both sides. Outliers are being kept for the analysis.

Estimated Shares Outstanding¶

In [25]:
histogram_boxplot(df, 'Estimated Shares Outstanding')
No description has been provided for this image

Observations:

  • Estimated shared is skewed to the right with some outliers to the right that will be used for the analysis.

P/E Ratio¶

In [26]:
histogram_boxplot(df, 'P/E Ratio')
No description has been provided for this image

Observations:

  • P/E ratio change follows a normal distribution with slight skeweness to the right. It has some outliers that will be kept as the bring value to the analysis.

P/B Ratio¶

In [27]:
histogram_boxplot(df, 'P/B Ratio')
No description has been provided for this image

Observations:

  • P/B ratio hows a clear normal distribution with some outliers that will be used in the analysis.

GICS Sector¶

In [35]:
labeled_barplot(df, 'GICS Sector', perc='true')
No description has been provided for this image

Observations:

  • The dataset comes with a good variety of data with the exception of Telecommunication services that there are only a handful.

GICS Sub Industry¶

In [29]:
labeled_barplot(df, 'GICS Sub Industry')
No description has been provided for this image

Observations:

  • GICS comes with a lot of variety and the dataset is well mixed.

Bivariate Analysis¶

Correlation Check¶

In [36]:
# correlation check
# Select only numeric columns for correlation matrix
numeric_df = df.select_dtypes(include=[np.number])

# Calculate the correlation matrix on numeric columns only
corr = numeric_df.corr()

# Heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr, annot=True, cmap='Blues', fmt='.2f')
plt.title('Correlation Matrix of Variables')
plt.show()
No description has been provided for this image

Observations:

  • There are no huge correlation between the different variables in the dataset.

Stocks of which economic sector have seen the maximum price increase on average¶

In [37]:
plt.figure(figsize=(15,8))
sns.barplot(data=df, x='Price Change', y='GICS Sector', ci=False, palette='Blues_r')
plt.show()
No description has been provided for this image

Observations:

  • Health care and consumer staples are the sectors that have seen the most price increase on average. In other hand, Energy has seen a big drop in value on average.

How Does the Average Cash Ratio Vary Across Economic Sectors¶

In [38]:
plt.figure(figsize=(15,8))
sns.barplot(data=df, x='Cash Ratio', y='GICS Sector', ci=False, palette='Blues_r')
plt.show()
No description has been provided for this image

Observations:

  • IT and telecomm services are the 2 sectors that show the largest cash ratio changes on average.

How Does the P/E Ratio Vary, on Average, Across Economic Sectors?¶

In [39]:
plt.figure(figsize=(15,8))
sns.barplot(data=df, x='P/E Ratio', y='GICS Sector', ci=False, palette='Blues_r')
plt.show()
No description has been provided for this image

Observations:

  • Energy, IT and real state services are the sectors that show the P/E ratio changes on average.

How volatility varies, on average, across economic sectors?¶

In [40]:
plt.figure(figsize=(15,8))
sns.barplot(data=df, x='Volatility', y='GICS Sector', ci=False, palette='Blues_r')
plt.show()
No description has been provided for this image

Observations:

  • Energy and Materials have had the largest volatility changes on average.

Data Preprocessing¶

  • Outlier check
In [41]:
plt.figure(figsize=(15, 12))

numeric_columns = df.select_dtypes(include=np.number).columns.tolist()

for i, variable in enumerate(numeric_columns):
    plt.subplot(3, 4, i + 1)
    plt.boxplot(df[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()
No description has been provided for this image

Observations:

  • In the numerival values in the DS, it is interesting to now how a big portion of the earnings per share are very close to 0 and there are many that present negative values.
  • Some others show values close to 0 which is important to consider while doing analysis. P/E Ratio, P/B ratio and net cash flow are examples of this behaviour.

Data Scaling before Clustering¶

Selecting only numerical values

In [42]:
num_cols = df.select_dtypes(include = np.number).columns.tolist()

Scaling process

In [43]:
scaler = StandardScaler()
subset = df[num_cols].copy()
subset_scaled = scaler.fit_transform(subset)

Creating dataframe of the scaled data

In [44]:
subset_scaled_df = pd.DataFrame(subset_scaled, columns = subset.columns)

Checking data

In [45]:
subset_scaled_df.head()
Out[45]:
Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio
0 -0.393341 0.493950 0.272749 0.989601 -0.210698 -0.339355 1.554415 1.309399 0.107863 -0.652487 -0.506653
1 -0.220837 0.355439 1.137045 0.937737 0.077269 -0.002335 0.927628 0.056755 1.250274 -0.311769 -0.504205
2 -0.367195 0.602479 -0.427007 -0.192905 -0.033488 0.454058 0.744371 0.024831 1.098021 -0.391502 0.094941
3 0.133567 0.825696 -0.284802 -0.317379 1.218059 -0.152497 -0.219816 -0.230563 -0.091622 0.947148 0.424333
4 -0.260874 -0.492636 0.296470 -0.265515 2.237018 0.133564 -0.202703 -0.374982 1.978399 3.293307 0.199196

K-means Clustering¶

Elbow Plot¶

In [46]:
k_means_df = subset_scaled_df.copy()
In [47]:
clusters = range(1, 15)
meanDistortions = []

for k in clusters:
    model = KMeans(n_clusters=k, random_state=1)
    model.fit(subset_scaled_df)
    prediction = model.predict(k_means_df)
    distortion = (
        sum(np.min(cdist(k_means_df, model.cluster_centers_, "euclidean"), axis=1))
        / k_means_df.shape[0]
    )

    meanDistortions.append(distortion)

    print("Number of Clusters:", k, "\tAverage Distortion:", distortion)

plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()
Number of Clusters: 1 	Average Distortion: 2.5425069919221697
Number of Clusters: 2 	Average Distortion: 2.3862098789299604
Number of Clusters: 3 	Average Distortion: 2.33620927590848
Number of Clusters: 4 	Average Distortion: 2.219050563833442
Number of Clusters: 5 	Average Distortion: 2.133404401901685
Number of Clusters: 6 	Average Distortion: 2.081503686093715
Number of Clusters: 7 	Average Distortion: 2.0045413402786814
Number of Clusters: 8 	Average Distortion: 1.9864237824874411
Number of Clusters: 9 	Average Distortion: 1.956222103389025
Number of Clusters: 10 	Average Distortion: 1.9360473996664198
Number of Clusters: 11 	Average Distortion: 1.8615942883461607
Number of Clusters: 12 	Average Distortion: 1.8219574388532505
Number of Clusters: 13 	Average Distortion: 1.7936924742607907
Number of Clusters: 14 	Average Distortion: 1.7567842179093438
No description has been provided for this image
In [48]:
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(1, 15), timings=True)
visualizer.fit(k_means_df)  # fit the data to the visualizer
visualizer.show()  # finalize and render figure
No description has been provided for this image
Out[48]:
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>

Observations:

  • Elbow method shows 8 as the best value to use for clustering.

Determine k Using the Silhouette Score¶

In [49]:
sil_score = []
cluster_list = range(2, 15)
for n_clusters in cluster_list:
    clusterer = KMeans(n_clusters=n_clusters, random_state=1)
    preds = clusterer.fit_predict((subset_scaled_df))
    score = silhouette_score(k_means_df, preds)
    sil_score.append(score)
    print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))

plt.plot(cluster_list, sil_score)
plt.show()
For n_clusters = 2, the silhouette score is 0.45335782729503565)
For n_clusters = 3, the silhouette score is 0.40374060030338865)
For n_clusters = 4, the silhouette score is 0.4246430808437099)
For n_clusters = 5, the silhouette score is 0.4381539778147092)
For n_clusters = 6, the silhouette score is 0.40869599703024256)
For n_clusters = 7, the silhouette score is 0.1207450219233897)
For n_clusters = 8, the silhouette score is 0.3693991650696542)
For n_clusters = 9, the silhouette score is 0.35185096182499204)
For n_clusters = 10, the silhouette score is 0.32950073703610283)
For n_clusters = 11, the silhouette score is 0.1486586842527321)
For n_clusters = 12, the silhouette score is 0.15784241071085106)
For n_clusters = 13, the silhouette score is 0.15646997458716602)
For n_clusters = 14, the silhouette score is 0.16253506827999134)
No description has been provided for this image
In [50]:
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2, 15), metric="silhouette", timings=True)
visualizer.fit(k_means_df)  # fit the data to the visualizer
visualizer.show()  # finalize and render figure
No description has been provided for this image
Out[50]:
<Axes: title={'center': 'Silhouette Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='silhouette score'>

Observations:

  • The highest silhouette score is observed for n_clusters = 5, representing the optimal number of clusters.
In [72]:
visualizer = SilhouetteVisualizer(KMeans(n_clusters=5, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
No description has been provided for this image
Out[72]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
  • An average score around 0.43 is looking good. Still will test with 6 and 8 to review how clusters look like.
In [74]:
visualizer = SilhouetteVisualizer(KMeans(n_clusters=6, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
No description has been provided for this image
Out[74]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [75]:
visualizer = SilhouetteVisualizer(KMeans(n_clusters=8, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
No description has been provided for this image
Out[75]:
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 8 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>

Observations:

  • After reviewing n_clusters in 6, 5 and 8. 5 is confirmed to be the best number of clusters showing high correlation between values within the clusters.
  • The silhoutte score 0.43 is best for k = 5

Creating final model.¶

In [76]:
kmeans = KMeans(n_clusters=5, random_state=1)
kmeans.fit(k_means_df)
Out[76]:
KMeans(n_clusters=5, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5, random_state=1)
In [78]:
# Creating a copy of the original data
dfK = df.copy()

# Adding Kmeans cluster labels to the original and scaled dataframes
k_means_df['KM_segments'] = kmeans.labels_
dfK['KM_segments'] = kmeans.labels_
In [79]:
dfK['KM_segments'].info()
<class 'pandas.core.series.Series'>
RangeIndex: 340 entries, 0 to 339
Series name: KM_segments
Non-Null Count  Dtype
--------------  -----
340 non-null    int32
dtypes: int32(1)
memory usage: 1.5 KB
In [80]:
dfK.head(10)
Out[80]:
Ticker Symbol Security GICS Sector GICS Sub Industry Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio KM_segments
0 AAL American Airlines Group Industrials Airlines 42.349998 9.999995 1.687151 135 51 -604000000 7610000000 11.39 6.681299e+08 3.718174 -8.784219 0
1 ABBV AbbVie Health Care Pharmaceuticals 59.240002 8.339433 2.197887 130 77 51000000 5144000000 3.15 1.633016e+09 18.806350 -8.750068 0
2 ABT Abbott Laboratories Health Care Health Care Equipment 44.910000 11.301121 1.273646 21 67 938000000 4423000000 2.94 1.504422e+09 15.275510 -0.394171 0
3 ADBE Adobe Systems Inc Information Technology Application Software 93.940002 13.977195 1.357679 9 180 -240840000 629551000 1.26 4.996437e+08 74.555557 4.199651 0
4 ADI Analog Devices, Inc. Information Technology Semiconductors 55.320000 -1.827858 1.701169 14 272 315120000 696878000 0.31 2.247994e+09 178.451613 1.059810 4
5 ADM Archer-Daniels-Midland Co Consumer Staples Agricultural Products 36.680000 -12.017268 1.516493 10 49 -189000000 1849000000 2.99 6.183946e+08 12.267559 7.496831 0
6 ADS Alliance Data Systems Information Technology Data Processing & Outsourced Services 276.570007 6.189286 1.116976 30 25 90885000 596541000 8.91 6.695185e+07 31.040405 129.064585 4
7 AEE Ameren Corp Utilities MultiUtilities 43.230000 2.174424 1.124186 9 14 287000000 636000000 2.60 2.446154e+08 16.626923 -0.719497 0
8 AEP American Electric Power Utilities Electric Utilities 58.270000 2.371753 1.068485 11 9 13900000 2052300000 3.13 4.218978e+08 18.456543 -3.022649 0
9 AFL AFLAC Inc Financials Life & Health Insurance 59.900002 3.027181 1.048295 14 99 -308000000 2533000000 5.88 4.307823e+08 10.187075 -1.883912 0
In [81]:
dfK.describe()
Out[81]:
Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio KM_segments
count 340.000000 340.000000 340.000000 340.000000 340.000000 3.400000e+02 3.400000e+02 340.000000 3.400000e+02 340.000000 340.000000 340.000000
mean 80.862345 4.078194 1.525976 39.597059 70.023529 5.553762e+07 1.494385e+09 2.776662 5.770283e+08 32.612563 -1.718249 0.476471
std 98.055086 12.006338 0.591798 96.547538 90.421331 1.946365e+09 3.940150e+09 6.587779 8.458496e+08 44.348731 13.966912 1.158287
min 4.500000 -47.129693 0.733163 1.000000 0.000000 -1.120800e+10 -2.352800e+10 -61.200000 2.767216e+07 2.935451 -76.119077 0.000000
25% 38.555000 -0.939484 1.134878 9.750000 18.000000 -1.939065e+08 3.523012e+08 1.557500 1.588482e+08 15.044653 -4.352056 0.000000
50% 59.705000 4.819505 1.385593 15.000000 47.000000 2.098000e+06 7.073360e+08 2.895000 3.096751e+08 20.819876 -1.067170 0.000000
75% 92.880001 10.695493 1.695549 27.000000 99.000000 1.698108e+08 1.899000e+09 4.620000 5.731175e+08 31.764755 3.917066 0.000000
max 1274.949951 55.051683 4.580042 917.000000 958.000000 2.076400e+10 2.444200e+10 50.090000 6.159292e+09 528.039074 129.064585 4.000000

Cluster Profiling¶

In [82]:
km_cluster_profile = dfK.groupby('KM_segments').mean(numeric_only = True)
In [83]:
km_cluster_profile['count_in_each_segment'] = (
    dfK.groupby("KM_segments")["Security"].count().values
)
In [84]:
km_cluster_profile.style.highlight_max(color = 'lightgreen', axis = 0)
Out[84]:
  Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio count_in_each_segment
KM_segments                        
0 72.841355 5.170873 1.381148 35.054348 53.481884 121362.318841 1548856637.681159 3.667808 441678653.226848 23.624755 -3.371505 276
1 38.194074 -15.163792 2.832583 48.148148 48.777778 -104439592.592593 -2427850333.333333 -6.262222 480223773.242222 89.543050 1.347101 27
2 24.485001 -13.351992 3.482611 802.000000 51.000000 -1292500000.000000 -19106500000.000000 -41.815000 519573983.250000 60.748608 1.565141 2
3 46.672222 5.166566 1.079367 25.000000 58.333333 -3040666666.666667 14848444444.444445 3.435556 4564959946.222222 15.596051 -6.354193 9
4 226.489357 13.425086 1.710605 25.346154 273.192308 1985390923.076923 1951357384.615385 5.905385 738326814.835385 72.627508 14.000626 26
In [85]:
for cl in dfK['KM_segments'].unique():
    print('There are total of', dfK[dfK["KM_segments"] == cl]['Security'].nunique(), 'companies in cluster {}, the following companies are present:'.format(cl))
    print(dfK[dfK["KM_segments"] == cl]['Security'].unique())
    print("=" * 100, "\n")
    print()
There are total of 276 companies in cluster 0, the following companies are present:
['American Airlines Group', 'AbbVie', 'Abbott Laboratories', 'Adobe Systems Inc', 'Archer-Daniels-Midland Co', ..., 'Xylem Inc.', 'Yum! Brands Inc', 'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis']
Length: 276
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
==================================================================================================== 


There are total of 26 companies in cluster 4, the following companies are present:
['Analog Devices, Inc.', 'Alliance Data Systems', 'Alexion Pharmaceuticals', 'Amgen Inc', 'Amazon.com Inc', ..., 'TripAdvisor', 'Vertex Pharmaceuticals Inc', 'Waters Corporation', 'Wynn Resorts Ltd', 'Yahoo Inc.']
Length: 26
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
==================================================================================================== 


There are total of 2 companies in cluster 2, the following companies are present:
['Apache Corporation', 'Chesapeake Energy']
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
==================================================================================================== 


There are total of 27 companies in cluster 1, the following companies are present:
['Anadarko Petroleum Corp', 'Baker Hughes Inc', 'Cabot Oil & Gas', 'Concho Resources', 'Devon Energy Corp.', ..., 'Spectra Energy Corp.', 'Southwestern Energy', 'Teradata Corp.', 'Williams Cos.', 'Cimarex Energy']
Length: 27
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
==================================================================================================== 


There are total of 9 companies in cluster 3, the following companies are present:
['Citigroup Inc.', 'Ford Motor', 'JPMorgan Chase & Co.', 'Coca Cola Company', 'Pfizer Inc.', 'AT&T Inc', 'Verizon Communications', 'Wells Fargo', 'Exxon Mobil Corp.']
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
==================================================================================================== 


In [86]:
dfK.groupby(["KM_segments", "GICS Sector"])["Security"].count()
Out[86]:
Security
KM_segments GICS Sector
0 Consumer Discretionary 33
Consumer Staples 17
Energy 5
Financials 45
Health Care 30
Industrials 52
Information Technology 24
Materials 18
Real Estate 26
Telecommunications Services 2
Utilities 24
1 Consumer Discretionary 0
Consumer Staples 0
Energy 21
Financials 0
Health Care 0
Industrials 1
Information Technology 3
Materials 2
Real Estate 0
Telecommunications Services 0
Utilities 0
2 Consumer Discretionary 0
Consumer Staples 0
Energy 2
Financials 0
Health Care 0
Industrials 0
Information Technology 0
Materials 0
Real Estate 0
Telecommunications Services 0
Utilities 0
3 Consumer Discretionary 1
Consumer Staples 1
Energy 1
Financials 3
Health Care 1
Industrials 0
Information Technology 0
Materials 0
Real Estate 0
Telecommunications Services 2
Utilities 0
4 Consumer Discretionary 6
Consumer Staples 1
Energy 1
Financials 1
Health Care 9
Industrials 0
Information Technology 6
Materials 0
Real Estate 1
Telecommunications Services 1
Utilities 0

In [87]:
plt.figure(figsize = (20,20))
plt.suptitle("Boxplot of numerical variables for each cluster")

# To select numerical columns
num_col = df.select_dtypes(include = np.number).columns.tolist()
for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i + 1)
    sns.boxplot(data = dfK, x = "KM_segments", y = variable, palette = 'husl', flierprops = dict(markerfacecolor='lightcoral', marker = 'D'))

plt.tight_layout(pad = 2.0)
No description has been provided for this image
In [88]:
# Generate a color map with a unique color for each bar
num_of_bars = len(dfK.columns) - 1  # Subtract 1 for the GROUP column
cm = plt.get_cmap('tab10')
colors = [cm(1. * i / num_of_bars) for i in range(num_of_bars)]
In [89]:
dfK.groupby("KM_segments").mean(numeric_only = True).plot.bar(figsize = (30,15), color = colors)
Out[89]:
<Axes: xlabel='KM_segments'>
No description has been provided for this image

Insights¶

  • Cluster 0
    • There are a total of 276 companies in cluster 0
    • The stocks in this cluster have the following detailes:
    • Negative net cash flow
    • Moderate Estimated shares outstanding
    • No other metrics associated.
  • Cluster 1
    • There are 27 companies in cluster 1
    • The stocks in this cluster have the following detailes:
    • Low net cash flow
    • Negative net income
    • Low Estimated shares outstanding
  • Cluster 2
    • There are 2 companies in cluster 2
    • The stocks in this cluster have the following detailes:
    • Negative net cash flow
    • Heavy negative net income
    • Low Estimated shares outstanding
  • Cluster 3
    • There are 9 companies in cluster 3
    • The stocks in this cluster have the following detailes:
    • Negative net cash flow
    • really high net income
    • Moderate Estimated shares outstanding
  • Cluster 4
    • There are 26 companies in cluster 4
    • The stocks in this cluster have the following detailes:
    • Low net cash flow
    • Low high net income
    • Low Estimated shares outstanding

Hierarchical Clustering¶

Cophenetic Correlation¶

In [90]:
hc_df = subset_scaled_df.copy()
In [91]:
# List of distance metrics
distance_metrics = ['euclidean', 'chebyshev', 'mahalanobis', 'cityblock']

# List of linkage methods
linkage_methods = ['single', 'complete', 'average', 'weighted']

high_cophenet_corr = 0
high_dm_lm = [0,0]

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(hc_df, metric = dm, method = lm)
        c, coph_dists = cophenet(Z, pdist(hc_df))
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(dm.capitalize(), lm, c)
        )
        if high_cophenet_corr < c:
            high_cophenet_corr = c
            high_dm_lm[0] = dm
            high_dm_lm[1] = lm


# Printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('***'*50)

print(
    "Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage".format(high_cophenet_corr,
                                                                                                     high_dm_lm[0].capitalize(),
                                                                                                     high_dm_lm[1])
)
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922.
Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672.
Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814.
Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404.
Cophenetic correlation for Chebyshev distance and single linkage is 0.9062538164750717.
Cophenetic correlation for Chebyshev distance and complete linkage is 0.598891419111242.
Cophenetic correlation for Chebyshev distance and average linkage is 0.9338265528030499.
Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9127355892367.
Cophenetic correlation for Mahalanobis distance and single linkage is 0.925919553052459.
Cophenetic correlation for Mahalanobis distance and complete linkage is 0.7925307202850002.
Cophenetic correlation for Mahalanobis distance and average linkage is 0.9247324030159736.
Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8708317490180428.
Cophenetic correlation for Cityblock distance and single linkage is 0.9334186366528574.
Cophenetic correlation for Cityblock distance and complete linkage is 0.7375328863205818.
Cophenetic correlation for Cityblock distance and average linkage is 0.9302145048594667.
Cophenetic correlation for Cityblock distance and weighted linkage is 0.731045513520281.
******************************************************************************************************************************************************
Highest cophenetic correlation is 0.9422540609560814, which is obtained with Euclidean distance and average linkage

Explore different linkage methods with Euclidean distance only.

In [92]:
# list of linkage methods
linkage_methods = ['single', 'complete', 'average', 'centroid', 'ward', 'weighted']

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for lm in linkage_methods:
    Z = linkage(hc_df, metric="euclidean", method=lm)
    c, coph_dists = cophenet(Z, pdist(hc_df))
    print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
    if high_cophenet_corr < c:
        high_cophenet_corr = c
        high_dm_lm[0] = "euclidean"
        high_dm_lm[1] = lm

# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
    "Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
        high_cophenet_corr, high_dm_lm[1]
    )
)
Cophenetic correlation for single linkage is 0.9232271494002922.
Cophenetic correlation for complete linkage is 0.7873280186580672.
Cophenetic correlation for average linkage is 0.9422540609560814.
Cophenetic correlation for centroid linkage is 0.9314012446828154.
Cophenetic correlation for ward linkage is 0.7101180299865353.
Cophenetic correlation for weighted linkage is 0.8693784298129404.
****************************************************************************************************
Highest cophenetic correlation is 0.9422540609560814, which is obtained with average linkage.

Observations:

  • The maximum Cophenetic Correlation is Euclidean distance and average linkage

Checking Dendograms¶

In [94]:
# list of linkage methods
linkage_methods = ['single', 'complete', 'average', 'centroid', 'ward', 'weighted']

# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []

# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(hc_df, metric="euclidean", method=method)

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(hc_df))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )
    compare.append([method, coph_corr])
No description has been provided for this image
In [95]:
# create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc = pd.DataFrame(compare, columns = compare_cols)
df_cc = df_cc.sort_values(by="Cophenetic Coefficient")
df_cc
Out[95]:
Linkage Cophenetic Coefficient
4 ward 0.710118
1 complete 0.787328
5 weighted 0.869378
0 single 0.923227
3 centroid 0.931401
2 average 0.942254

Observations

  • The cophenetic correlation is highest for average and centroid linkage methods, but the dendrogram for average appears to provide better clusters
  • 5 appears to be the appropriate number of clusters for the average linkage method

Creating model using sklearn with Euclidean and Average linkage¶

In [113]:
HCmodel = AgglomerativeClustering(n_clusters = 5,  metric = 'euclidean', linkage = 'average')
HCmodel.fit(hc_df)
Out[113]:
AgglomerativeClustering(linkage='average', n_clusters=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(linkage='average', n_clusters=5)
In [114]:
# creating a copy of the original data
df2 = df.copy()

# adding hierarchical cluster labels to the original and scaled dataframes
hc_df["HC_segments"] = HCmodel.labels_
df2["HC_segments"] = HCmodel.labels_

Cluster Profiling for Euclidean - Average¶

In [115]:
hc_cluster_profile = df2.groupby("HC_segments").mean(numeric_only = True)
In [116]:
hc_cluster_profile['count_in_each_segment'] = (
    df2.groupby("HC_segments")["Security"].count().values
)
In [117]:
hc_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
Out[117]:
  Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio count_in_each_segment
HC_segments                        
0 77.884243 4.105986 1.516865 35.320359 66.775449 -32825817.365269 1535255703.592814 2.903308 559027333.145509 32.437511 -1.781988 334
1 25.640000 11.237908 1.322355 12.500000 130.500000 16755500000.000000 13654000000.000000 3.295000 2791829362.100000 13.649696 1.508484 2
2 24.485001 -13.351992 3.482611 802.000000 51.000000 -1292500000.000000 -19106500000.000000 -41.815000 519573983.250000 60.748608 1.565141 2
3 104.660004 16.224320 1.320606 8.000000 958.000000 592000000.000000 3669000000.000000 1.310000 2800763359.000000 79.893133 5.884467 1
4 1274.949951 3.190527 1.268340 29.000000 184.000000 -1671386000.000000 2551360000.000000 50.090000 50935516.070000 25.453183 -1.052429 1
In [118]:
## Complete the code to print the companies in each cluster
for cl in df2["HC_segments"].unique():
    print("In cluster {}, the following companies are present:".format(cl))
    print(df2[df2["HC_segments"] == cl]["Security"].unique())
    print("=" * 100, "\n")
    print()
In cluster 0, the following companies are present:
['American Airlines Group', 'AbbVie', 'Abbott Laboratories', 'Adobe Systems Inc', 'Analog Devices, Inc.', ..., 'Yahoo Inc.', 'Yum! Brands Inc', 'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis']
Length: 334
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
==================================================================================================== 


In cluster 2, the following companies are present:
['Apache Corporation', 'Chesapeake Energy']
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
==================================================================================================== 


In cluster 1, the following companies are present:
['Bank of America Corp', 'Intel Corp.']
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
==================================================================================================== 


In cluster 3, the following companies are present:
['Facebook']
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
==================================================================================================== 


In cluster 4, the following companies are present:
['Priceline.com Inc']
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
==================================================================================================== 


In [106]:
df2.groupby(["HC_segments", "GICS Sector"])["Security"].count()
Out[106]:
Security
HC_segments GICS Sector
0 Consumer Discretionary 39
Consumer Staples 19
Energy 28
Financials 48
Health Care 40
Industrials 53
Information Technology 31
Materials 20
Real Estate 27
Telecommunications Services 5
Utilities 24
1 Consumer Discretionary 0
Consumer Staples 0
Energy 0
Financials 1
Health Care 0
Industrials 0
Information Technology 1
Materials 0
Real Estate 0
Telecommunications Services 0
Utilities 0
2 Consumer Discretionary 0
Consumer Staples 0
Energy 2
Financials 0
Health Care 0
Industrials 0
Information Technology 0
Materials 0
Real Estate 0
Telecommunications Services 0
Utilities 0
3 Consumer Discretionary 0
Consumer Staples 0
Energy 0
Financials 0
Health Care 0
Industrials 0
Information Technology 1
Materials 0
Real Estate 0
Telecommunications Services 0
Utilities 0
4 Consumer Discretionary 1
Consumer Staples 0
Energy 0
Financials 0
Health Care 0
Industrials 0
Information Technology 0
Materials 0
Real Estate 0
Telecommunications Services 0
Utilities 0

In [ ]:
plt.figure(figsize = (20,20))
plt.suptitle("Boxplot of numerical variables for each cluster")

for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i+1)
    sns.boxplot(data = df2, x = "HC_segments", y = variable, palette = 'husl', flierprops = dict(markerfacecolor = 'lightcoral', marker = 'D',))

plt.tight_layout(pad = 2.0)
No description has been provided for this image

Insights

  • Cluster 0: There are 334 companies in cluster 0.
  • Cluster 1: There are only 2 companies in Cluster 1.
  • Cluster 2: There are only 2 companies in Cluster 2.
  • Cluster 3: There are only 1 companies in Cluster 3.
  • Cluster 4: There are only 1 companies in Cluster 4.

- The clustering does not look good since clusters don't have enough variability.

- Ward linkage to be used and test if works better. Taken from dendogram.

- Using 5 clusters for ward linkage.

Creating Model using Euclidean distance and Ward linkage¶

In [130]:
HCmodel = AgglomerativeClustering(n_clusters = 5, metric = 'euclidean', linkage = 'ward')
HCmodel.fit(hc_df)
Out[130]:
AgglomerativeClustering(n_clusters=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(n_clusters=5)
In [131]:
# Creating a copy of the original data for Euclidean Ward
dfHCW = df.copy()

# Adding hierarchical cluster labels to the original and scaled datagrames
hc_df['HC_segments'] = HCmodel.labels_
dfHCW['HC_segments'] = HCmodel.labels_

Cluster Profiling for Euclidean - Ward¶

In [132]:
hc_cluster_profile = dfHCW.groupby("HC_segments").mean(numeric_only = True)
In [133]:
hc_cluster_profile['count_in_each_segment'] = (dfHCW.groupby("HC_segments")["Security"].count().values)
In [134]:
hc_cluster_profile.style.highlight_max(color='lightgreen', axis = 0)
Out[134]:
  Current Price Price Change Volatility ROE Cash Ratio Net Cash Flow Net Income Earnings Per Share Estimated Shares Outstanding P/E Ratio P/B Ratio count_in_each_segment
HC_segments                        
0 326.198218 10.563242 1.642560 14.400000 309.466667 288850666.666667 864498533.333333 7.785333 544900261.301333 113.095334 19.142151 15
1 72.432035 5.255290 1.426349 25.512195 60.881533 196157425.087108 1623022236.933798 3.652840 462816085.049861 24.645506 -2.618234 287
2 46.672222 5.166566 1.079367 25.000000 58.333333 -3040666666.666667 14848444444.444445 3.435556 4564959946.222222 15.596051 -6.354193 9
3 84.355716 3.854981 1.827670 633.571429 33.571429 -568400000.000000 -4968157142.857142 -10.841429 398169036.442857 42.284541 -11.589502 7
4 36.440455 -16.073408 2.832884 57.500000 42.409091 -472834090.909091 -3161045227.272727 -8.005000 514367806.201818 85.555682 0.836839 22
In [135]:
# List the names of companies in each cluster
for cl in dfHCW['HC_segments'].unique():
    print('There are total of', dfHCW[dfHCW['HC_segments'] == cl]['Security'].nunique(), 'companies in cluster {}, the following companies are present:'.format(cl))
    print(dfHCW[dfHCW['HC_segments'] == cl]['Security'].unique())
    print('*' * 100, '\n')
    print()
There are total of 287 companies in cluster 1, the following companies are present:
['American Airlines Group', 'AbbVie', 'Abbott Laboratories', 'Adobe Systems Inc', 'Analog Devices, Inc.', ..., 'Xylem Inc.', 'Yum! Brands Inc', 'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis']
Length: 287
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
**************************************************************************************************** 


There are total of 15 companies in cluster 0, the following companies are present:
['Alliance Data Systems', 'Alexion Pharmaceuticals', 'Amgen Inc', 'Amazon.com Inc', 'Chipotle Mexican Grill', ..., 'Netflix Inc.', 'Priceline.com Inc', 'Regeneron', 'Waters Corporation', 'Yahoo Inc.']
Length: 15
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
**************************************************************************************************** 


There are total of 7 companies in cluster 3, the following companies are present:
['Allegion', 'Apache Corporation', 'Chesapeake Energy', 'Charter Communications', 'Colgate-Palmolive', 'Kimberly-Clark', 'S&P Global, Inc.']
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
**************************************************************************************************** 


There are total of 22 companies in cluster 4, the following companies are present:
['Anadarko Petroleum Corp', 'Baker Hughes Inc', 'Cabot Oil & Gas', 'Concho Resources', 'Devon Energy Corp.', ..., 'Spectra Energy Corp.', 'Southwestern Energy', 'Teradata Corp.', 'Williams Cos.', 'Cimarex Energy']
Length: 22
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
**************************************************************************************************** 


There are total of 9 companies in cluster 2, the following companies are present:
['Citigroup Inc.', 'Ford Motor', 'JPMorgan Chase & Co.', 'Coca Cola Company', 'Pfizer Inc.', 'AT&T Inc', 'Verizon Communications', 'Wells Fargo', 'Exxon Mobil Corp.']
Categories (340, object): ['3M Company', 'AFLAC Inc', 'AMETEK Inc', 'AT&T Inc', ...,
                           'Zimmer Biomet Holdings', 'Zions Bancorp', 'Zoetis', 'eBay Inc.']
**************************************************************************************************** 


In [136]:
dfHCW.groupby(['HC_segments', 'GICS Sector'])['Security'].count()
Out[136]:
Security
HC_segments GICS Sector
0 Consumer Discretionary 3
Consumer Staples 1
Energy 0
Financials 0
Health Care 5
Industrials 0
Information Technology 4
Materials 0
Real Estate 1
Telecommunications Services 1
Utilities 0
1 Consumer Discretionary 35
Consumer Staples 15
Energy 7
Financials 45
Health Care 34
Industrials 52
Information Technology 28
Materials 19
Real Estate 26
Telecommunications Services 2
Utilities 24
2 Consumer Discretionary 1
Consumer Staples 1
Energy 1
Financials 3
Health Care 1
Industrials 0
Information Technology 0
Materials 0
Real Estate 0
Telecommunications Services 2
Utilities 0
3 Consumer Discretionary 1
Consumer Staples 2
Energy 2
Financials 1
Health Care 0
Industrials 1
Information Technology 0
Materials 0
Real Estate 0
Telecommunications Services 0
Utilities 0
4 Consumer Discretionary 0
Consumer Staples 0
Energy 20
Financials 0
Health Care 0
Industrials 0
Information Technology 1
Materials 1
Real Estate 0
Telecommunications Services 0
Utilities 0

In [137]:
plt.figure(figsize = (20,20))
plt.suptitle("Boxplot of numerical variables for each cluster with Euclidean Ward")

for i, variable in enumerate(num_col):
    plt.subplot(3, 4, i+1)
    sns.boxplot(data = dfHCW, x = 'HC_segments', y = variable, palette = 'husl', flierprops = dict(markerfacecolor = 'lightcoral', marker = 'D',))

plt.tight_layout(pad = 2.0)
No description has been provided for this image
In [138]:
# Generate a color map with a unique color for each bar
num_of_bars = len(dfHCW.columns) - 1  # Subtract 1 for the GROUP column
cm = plt.get_cmap('tab10')
colors = [cm(1. * i / num_of_bars) for i in range(num_of_bars)]
In [139]:
dfHCW.groupby("HC_segments").mean(numeric_only = True).plot.bar(figsize = (30,15), color = colors)
Out[139]:
<Axes: xlabel='HC_segments'>
No description has been provided for this image

Insights

  • Cluster 0
    • There are 15 companies in Cluster 0.
    • The stocks in this cluster have the following detailes:
      • Very low net cash flow
      • Low net income
      • Low Estimated shares outstanding
  • Cluster 1
    • There are 287 companies in Cluster 1.
    • The stocks in this cluster have the following detailes:
      • Very Low net cash flow
      • Moderate net income
      • Low Estimated shares outstanding
  • Cluster 2
    • There are 9 companies in Cluster 2.
    • The stocks in this cluster have the following detailes:
      • Negative net cash flow
      • Very High net income
      • High Estimated shares outstanding
  • Cluster 3
    • There are 7 companies in Cluster 3.
    • The stocks in this cluster have the following detailes:
      • Negative Low net cash flow
      • High negative net income
      • Very low Estimated shares outstanding
  • Cluster 4
    • There are 22 companies in Cluster 4.
    • The stocks in this cluster have the following detailes:
      • Negative Low net cash flow
      • Moderate negative net income
      • Very low Estimated shares outstanding

K-means vs Hierarchical Clustering¶

  • Hierarchical and K-means had almost the same performance in execution, no major differences.
  • Both techniques showed two similar clusters and others dissimilar. Ward linkage in Hierarchical showed a better clustering result.
  • The groupings using Hierarchical is better than K-means in terms of distribution of companies.
  • Both algorithms yielded similar clusters based on the outliers using the variables provided.

Actionable Insights and Recommendations¶

Recommendations:

  • Algorithm Refinement: The difference in cluster suggestions from K-Means and Hierarchical clustering methods indicates a need for algorithm refinement.
  • More variables are required to have a better analysis since both methods have one particular cluster with many companies and some others with just a few showing that there's not very good distribution of portfolios.
  • As of now, the tool doesn't allow to recommend specific sectors to invest, more analysis needs to be done on this end.

Insights

  • Trade&Ahead needs to consider their clients' profile and how they want to invest, before suggesting any of the clusters (portfolios) aligned by this tool since in almost all there are risks associated.
  • Depending on the client's profile the tool does provide portfolios whith details to take decisions such as net cash and overall shares outstanding.