Jekyll2021-01-14T01:24:50+00:00http://ryan-davidson.com/feed.xmlRyan DavidsonData Science PortfolioRyan Davidsonryanhdavidson14@gmail.comDbscan2020-10-21T00:00:00+00:002020-10-21T00:00:00+00:00http://ryan-davidson.com/DBSCAN<h2 id="dbscan-algorithm-from-scratch-in-python">DBSCAN Algorithm from Scratch in Python</h2>
<p>DBSCAN (<strong>D</strong>ensity-<strong>B</strong>ased <strong>S</strong>patial <strong>C</strong>lustering of <strong>A</strong>pplications with <strong>N</strong>oise) is a popular <strong>unsupervised</strong> learning method utilized in model building and machine learning algorithms originally proposed by <a href="https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf">Ester et al</a> in 1996. Before we go any further, we need to define what is “unsupervised” learning method. **Unsupervised **learning methods are when there is no clear objective or outcome we are seeking to find. Instead, we are clustering the data together based on the similarity of observations.</p>
<h2 id="definitions">Definitions</h2>
<ul>
<li>
<p>ε (epsilon): the radius of a neighborhood centered on a given point</p>
</li>
<li>
<p>Core Point: a given point is considered a Core Point if there are at least <em>minPts</em> points within its ε neighborhood, including itself</p>
</li>
<li>
<p>Border Point: a given point is considered a Borer Point if there fewer than *minPts *points within its ε neighborhood, including itself</p>
</li>
<li>
<p>Noise: any point that is not a Core Point or Border Point</p>
</li>
<li>
<p>Directly Density Reachable: a given point is Directly Density Reachable (ε Reachable) from another point if the second point is a core point, and the first point lies within the ε neighborhood of the second point</p>
</li>
<li>
<p>Density Reachable: a given point is Density Reachable from another point if there is a chain of points, Directly Density Reachable from each other, that connects them</p>
</li>
<li>
<p>Density Connected: A given point is Density Connected from another point if there is a third point from which both are Density Reachable — These points are said to be Connected Components</p>
</li>
</ul>
<h2 id="dbscan-in-a-nutshell">DBSCAN in a Nutshell</h2>
<p>Given a set of points <em>P</em>, the radius of a neighborhood ε, and a minimum number of points <em>minPts:</em></p>
<ol>
<li>
<p>Find all points within the ε neighborhood of each point;</p>
</li>
<li>
<p>Identify Core Points with at least <em>minPts</em> neighbors;</p>
</li>
<li>
<p>Find all Connected Components of each core point — This Density Connected grouping of points is a cluster</p>
</li>
<li>
<p>Each Border Point is assigned to a cluster if that cluster is Density Reachable, otherwise, Border Point is considered Noise</p>
</li>
</ol>
<p>Any given point may initially be considered noise and later revised to belong to a cluster, but once assigned to a cluster a point will never be assigned.</p>
<p>Okay if that was too much let’s explain it in some simpler terms:</p>
<p>Given that <strong>DBSCAN</strong> is a <strong>density-based clustering algorithm</strong>, it does a great job of seeking areas in the data that have a high density of observations, versus areas of the data that are not very dense with observations. DBSCAN can sort data into clusters of varying shapes as well, another strong advantage. DBSCAN works as such:</p>
<ul>
<li>
<p>Divides the dataset into *n *dimensions</p>
</li>
<li>
<p>For each point in the dataset, DBSCAN forms an <em>n-dimensional</em> shape around that data point and then counts how many data points fall within that shape.</p>
</li>
<li>
<p>DBSCAN counts this shape as a *cluster. *DBSCAN iteratively expands the cluster, by going through each individual point within the cluster, and counting the number of other data points nearby. Take the graphic below for an example:</p>
</li>
</ul>
<p><img src="https://cdn-images-1.medium.com/max/3190/1*t4QjgJ0JfDq9_VNGDNcj5w.png" alt="" /></p>
<p>Going through the process step-by-step, DBSCAN will start by dividing the data into <em>n</em> dimensions. After DBSCAN has done so, it will start at a random point (in this case let’s assume it was one of the purple points), and it will count how many other points are nearby. DBSCAN will continue this process until no other data points are nearby, and then it will look to form a second cluster.</p>
<p>As you may have noticed from the graphic, there are a couple of parameters and specifications that we need to give DBSCAN before it does its work.</p>
<p><img src="https://cdn-images-1.medium.com/max/3190/1*t4QjgJ0JfDq9_VNGDNcj5w.png" alt="" /></p>
<p>Referring back to the graphic, the *epsilon *is the radius given to test the distance between data points. If a point falls within the *epsilon *distance of another point, those two points will be in the same cluster.</p>
<p>Furthermore, the <em>minimum number of points needed** **</em>is set to 4 in this scenario. When going through each data point, as long as DBSCAN finds 4 points within <em>epsilon</em> distance of each other, a cluster is formed.</p>
<p>Naftali Harris has created a great web-based <a href="https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/">visualization</a> of running DBSCAN on a 2-dimensional dataset where you can set *epsilon *to higher and lower values. Try clicking on the “Smiley” dataset and hitting the GO button.</p>
<h2 id="dbscan-vs-k-means-clustering">DBSCAN vs K-Means Clustering</h2>
<p>DBSCAN is a popular clustering algorithm that is fundamentally very different from k-means.</p>
<ul>
<li>
<p>In k-means clustering, each cluster is represented by a centroid, and points are assigned to whichever centroid they are closest to. In DBSCAN, there are no centroids, and clusters are formed by linking nearby points to one another.</p>
</li>
<li>
<p>k-means requires specifying the number of clusters, ‘k’. DBSCAN does not but does require specifying two parameters which influence the decision of whether two nearby points should be linked into the same cluster. These two parameters are a distance threshold, ε (epsilon), and “MinPts” (minimum number of points), to be explained.</p>
</li>
<li>
<p>k-means runs over many iterations to converge on a good set of clusters, and cluster assignments can change on each iteration. DBSCAN makes only a single pass through the data, and once a point has been assigned to a particular cluster, it never changes.</p>
</li>
</ul>
<h2 id="my-approach-to-the-dbscan-algorithm">My Approach to the DBSCAN Algorithm</h2>
<p>I like the language of trees for describing cluster growth in DBSCAN. It starts with an arbitrary seed point which has at least MinPts points nearby within a distance or “radius” of ε. We do a breadth-first search along each of these nearby points. For a given nearby point, we check how many points <em>it</em> has within its radius. If it has fewer than MinPts neighbors, this point becomes a <em>leaf</em>–we don’t continue to grow the cluster from it. If it <em>does</em> have at least MinPts, however, then it’s a <em>branch</em>, and we add all of its neighbors to the FIFO (“First In, First Out”) queue of our breadth-first search.</p>
<p>Once the breadth-first search is complete, we are done with that cluster and it will not be revisited any of the points in it. We pick a new arbitrary seed point (which isn’t already part of another cluster) and grow the next cluster. This continues until all of the points have been assigned.</p>
<p>There is one other novel aspect of DBSCAN which affects the algorithm. If a point has fewer than MinPts neighbors, <strong><em>AND it’s not a leaf node of another cluster</em></strong>, then it’s labeled as a “Noise” point that doesn’t belong to any cluster.</p>
<p>Noise points are identified as part of the process of selecting a new seed if a particular seed point does not have enough neighbors, then it is labeled as a Noise point. This label is often temporary, however–these Noise points are often picked up by some cluster as a leaf node.</p>
<p>To understand this, I feel like it is best to just jump headfirst into the code:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import numpy
def dbscan(D, eps, MinPts):
'''
Cluster the dataset `D` using the DBSCAN algorithm.
dbscan takes a dataset `D` (a list of vectors), a threshold distance
`eps`, and a required number of points `MinPts`.
It will return a list of cluster labels. The label -1 means noise, and then
the clusters are numbered starting from 1.
'''
# This list will hold the final cluster assignment for each point in D.
# There are two reserved values:
# -1 - Indicates a noise point
# 0 - Means the point hasn't been considered yet.
# Initially all labels are 0.
labels = [0]*len(D)
# C is the ID of the current cluster.
C = 0
# This outer loop is just responsible for picking new seed points--a point
# from which to grow a new cluster.
# Once a valid seed point is found, a new cluster is created, and the
# cluster growth is all handled by the 'expandCluster' routine.
# For each point P in the Dataset D...
# ('P' is the index of the datapoint, rather than the datapoint itself.)
for P in range(0, len(D)):
# Only points that have not already been claimed can be picked as new
# seed points.
# If the point's label is not 0, continue to the next point.
if not (labels[P] == 0):
continue
# Find all of P's neighboring points.
NeighborPts = region_query(D, P, eps)
# If the number is below MinPts, this point is noise.
# This is the only condition under which a point is labeled
# NOISE--when it's not a valid seed point. A NOISE point may later
# be picked up by another cluster as a boundary point (this is the only
# condition under which a cluster label can change--from NOISE to
# something else).
if len(NeighborPts) < MinPts:
labels[P] = -1
# Otherwise, if there are at least MinPts nearby, use this point as the
# seed for a new cluster.
else:
C += 1
grow_cluster(D, labels, P, NeighborPts, C, eps, MinPts)
# All data has been clustered!
return labels
def grow_cluster(D, labels, P, NeighborPts, C, eps, MinPts):
'''
Grow a new cluster with label `C` from the seed point `P`.
This function searches through the dataset to find all points that belong
to this new cluster. When this function returns, cluster `C` is complete.
Parameters:
`D` - The dataset (a list of vectors)
`labels` - List storing the cluster labels for all dataset points
`P` - Index of the seed point for this new cluster
`NeighborPts` - All of the neighbors of `P`
`C` - The label for this new cluster.
`eps` - Threshold distance
`MinPts` - Minimum required number of neighbors
'''
# Assign the cluster label to the seed point.
labels[P] = C
# Look at each neighbor of P (neighbors are referred to as Pn).
# NeighborPts will be used as a FIFO queue of points to search--that is, it
# will grow as we discover new branch points for the cluster. The FIFO
# behavior is accomplished by using a while-loop rather than a for-loop.
# In NeighborPts, the points are represented by their index in the original
# dataset.
i = 0
while i < len(NeighborPts):
# Get the next point from the queue.
Pn = NeighborPts[i]
# If Pn was labelled NOISE during the seed search, then we
# know it's not a branch point (it doesn't have enough neighbors), so
# make it a leaf point of cluster C and move on.
if labels[Pn] == -1:
labels[Pn] = C
# Otherwise, if Pn isn't already claimed, claim it as part of C.
elif labels[Pn] == 0:
# Add Pn to cluster C (Assign cluster label C).
labels[Pn] = C
# Find all the neighbors of Pn
PnNeighborPts = region_query(D, Pn, eps)
# If Pn has at least MinPts neighbors, it's a branch point!
# Add all of its neighbors to the FIFO queue to be searched.
if len(PnNeighborPts) >= MinPts:
NeighborPts = NeighborPts + PnNeighborPts
# If Pn *doesn't* have enough neighbors, then it's a leaf point.
# Don't queue up it's neighbors as expansion points.
#else:
# Do nothing
#NeighborPts = NeighborPts
# Advance to the next point in the FIFO queue.
i += 1
# We've finished growing cluster C!
def region_query(D, P, eps):
'''
Find all points in dataset `D` within distance `eps` of point `P`.
This function calculates the distance between a point P and every other
point in the dataset, and then returns only those points which are within a
threshold distance `eps`.
'''
neighbors = []
# For each point in the dataset...
for Pn in range(0, len(D)):
# If the distance is below the threshold, add it to the neighbors list.
if numpy.linalg.norm(D[P] - D[Pn]) < eps:
neighbors.append(Pn)
return neighbors
</code></pre></div></div>
<p>Above is a working implementation within Python. Please note that this emphasis in the implementation of the algorithm… the distance calculations, for example, could be optimized significantly.</p>
<p>You can also find this code along with a validation python file on GitHub <a href="https://github.com/scrunts23/CS-Data-Science-Build-Week-1/tree/master/model">here</a>.</p>Ryan Davidsonryanhdavidson14@gmail.comDBSCAN Algorithm from Scratch in PythonMillennial’s: More Narcissistic than Other Generations?2020-03-11T00:00:00+00:002020-03-11T00:00:00+00:00http://ryan-davidson.com/Narcissistic<h2 id="millennials-more-narcissistic-than-other-generations">Millennial’s: More Narcissistic than Other Generations?</h2>
<p><strong>Brief History Lesson in Greek Mythology:</strong></p>
<p>Narcissism as a personality trait is generally conceived as excessive self-love. In Greek mythology, Narcissus was a man who fell in love with his reflection in a pool of water who disdained those that loved him, even causing the individuals around him to take their own life to prove their devotion to his striking beauty. Thus, giving us the modern-day term of Narcissism from his personality traits.</p>
<p><strong>Introduction to the NPI Survey:</strong></p>
<p><img src="https://cdn-images-1.medium.com/max/5458/1*XLy6FzJW1GfSboJujn-0Ag.jpeg" alt="" /></p>
<p>The NPI (Narcissistic Personality Inventory) was developed by Raskin and Hall (1979) for the measurement of narcissism as a personality trait within social psychological research. It is based on the definition of narcissistic personality disorder found in the DSM-III (<em>Diagnostic and Statistical Manual of Mental Disorders Vol 3)</em> but this is not a diagnostic tool for Narcissistic personality disorder (NPD) and instead measures sub-clinical or social expressions for narcissism. So, even someone who gets the highest possible score on the NPI does not necessarily have NPD. The NPI consists of forty pairs of statements that should be carefully selected to best reflect one’s personality.</p>
<p><strong>Data Processing:</strong></p>
<p>Upon initial investigation of the NPI assessment results, the data frame included 11,243 entries:</p>
<p>· Gender data for everyone that completed the NPI assessment</p>
<p>· Age of individuals that completed the NPI assessment (ages below 15 were omitted in the data collection process)</p>
<p>· Score for each paired question (n = 40)</p>
<p>· The total score for the NPI assessment (range of 0–40)</p>
<p>· Time elapsed for taking the NPI assessment</p>
<p>The next step was to filter the data to find if there were any outliers present within the age distribution. From this step, it was found that it was necessary to omit a portion of the NPI assessment results to ensure the age distribution was between 15–100 years old.</p>
<p><img src="https://cdn-images-1.medium.com/max/2000/1*ayerkZKzAWePqFEBpRwUuQ.png" alt="Identifying age outliers in with the data" /></p>
<p>Once the data was processed, the sample size resulted in 11,009 individuals that were within the age range of 15–100 years old. The next step was to create Generation data based on the ages of the individuals that completed the NPI assessment. The break down of Generations are below:</p>
<p>· Generation Z (15–25 years old)</p>
<p>· Millennials (26–40 years old)</p>
<p>· Generation X (41–55 years old)</p>
<p>· Baby Boomer Generation (56–74 years old)</p>
<p>· The Silent Generation (75–100 years old)</p>
<p><strong>Initial Data Exploration:</strong></p>
<p>Using the processed data, histograms were created to show the overall distribution of scores with the number of NPI assessments taken. As seen below, the average score is approximately 13.2.</p>
<p><img src="https://cdn-images-1.medium.com/max/2000/1*02j961qstF3h8zZBe9ygcA.png" alt="" /></p>
<p>With having gender categories within the original data set the next step would be to break down the distribution of scores based on gender (Male, Female, and other Genders):</p>
<p><img src="https://cdn-images-1.medium.com/max/2000/1*cnRXrJa-ujsc2QoVFfd0Jg.png" alt="" /></p>
<p><img src="https://cdn-images-1.medium.com/max/2000/1*zuetiM3dTkhQ2ola4axvRQ.png" alt="" /></p>
<p><img src="https://cdn-images-1.medium.com/max/2000/1*cB-NDJZL-po7HNftzBDTrQ.png" alt="Distribution of Scores based on Gender" /></p>
<p>As expected, the average of each gender roughly mimics the overall distribution of scores. The male category results show an average score of 14.1 with a sample size of 6,310. Looking at the other gender labeled classification also has an average score of 14.1 but has a much smaller sample size of 34. Turning to the female category, they result in an average score of 11.9 with a sample size of 4,665. The results of the distribution of scores by gender show that women tend to have lower narcissistic tendencies compared to the other two gender categories.</p>
<p><strong>Which Generation is More Narcissistic?</strong></p>
<p>Using the Generation data that was created in the data processing step the next step would be to create a density plot to show the overall distribution of scores based on Generations to answer the question of which generation has higher narcissistic traits based on the NPI assessment.</p>
<p><img src="https://cdn-images-1.medium.com/max/2000/1*lWTiV-3zWx3T_DtDACWadQ.png" alt="Density plot for scores based on Generations" /></p>
<p>Based on the graph above, you can see that The Silent, Baby Boomer and Gen X Generations share similar trends with an average score below 10 and both Gen Z and Millennials share similar trends with a higher average score closer to the whole samples mean. Based on the overall distribution, Generation Z on average has more individuals with higher narcissistic tendencies. The next step is to look at the breakdown of the genders and generation data together to see if it shows a similar trend to the overall generation density plot.</p>
<p><img src="https://cdn-images-1.medium.com/max/2000/1*E8tAC7CYBRDSjj3UkQn3NA.png" alt="" /></p>
<p><img src="https://cdn-images-1.medium.com/max/2000/1*cFL5NNyB2OIzSiIush5lUg.png" alt="" /></p>
<p><img src="https://cdn-images-1.medium.com/max/2000/1*ZPn9Mndin7mzH9YIVaEbhw.png" alt="Density plot of Scores based on Generations for Males, Females and Other Genders" /></p>
<p>Similar to the histograms above, one can interoperate that the males within this data set show that they have higher narcissistic tendencies based on the NPI assessment answers. Each plot, especially the male and female plots; shows that individuals within the Generation Z have a higher number of people with higher NPI assessment scores. The density plot for the labeled category of other genders is skewed due to the lower amount of individuals who identify as other genders.</p>
<p><strong>Deeper Dive into the Generation Data:</strong></p>
<p>From the analysis of the distribution of scores overall and the generation distribution scores based on gender types, the next step is to take a deeper look at the breakdown of the number count of individuals within a set score range.</p>
<p><img src="https://cdn-images-1.medium.com/max/2000/1*AF2VgzZJOn6iqxfrrSXAfQ.png" alt="Heat map of Scores based on Generations" /></p>
<p>From the comparison of the overall non-gender specific density plots and heat map, we do confirm that Gen Z individuals as a whole, have a higher number of participants that score higher based on their assessment scores.</p>
<p><img src="https://cdn-images-1.medium.com/max/2000/1*BKjTgAMBAEesiub6KSLdKA.png" alt="Heat map of Scores based on Generations for Males" /></p>
<p>From the interpretation of the male generation distribution, male individuals within Generation Z have a higher average score from the NPI assessment.</p>
<p><img src="https://cdn-images-1.medium.com/max/2000/1*N8qhgTd9wJpH8b8Qlxypug.png" alt="Heat map of Scores based on Generations for Females" /></p>
<p>Looking at the breakdown of the females within the dataset we can confirm that most of the females fall within the calculated average of 11.9. Similar to the males, the females also share a trend in that Generation Z on average has a higher collection of individuals with higher narcissistic tendencies.</p>
<p><img src="https://cdn-images-1.medium.com/max/2000/1*0hwuYxtl2ikoB7rDN3OfVg.png" alt="Heat map of Scores based on Generations for Other Genders" /></p>
<p>With the Gender that is labeled as other, the distribution is confirmed the majority of the individuals fall within the calculated average score of 14.1. With this gender group, it is hard to see a similar pattern to the other gender groups due to the lower number of individuals within the other gender category.</p>
<p><strong>Conclusions: Who is more Narcissistic?</strong></p>
<p>Based on the information within this sample data it was found on average those who completed the NPI assessment, it is determined that Generation Z and Millennials have higher narcissistic tendencies based on scores of the NPI assessment questions. From the heat maps analysis across all studied gender groups, individuals that fall within Generation Z would show more narcissistic tendencies especially males that fall within this Generation.</p>
<p><img src="https://cdn-images-1.medium.com/max/2000/1*Rz8ym3nJ1YPjSee2Rdr5gQ.jpeg" alt="" /></p>
<p><strong>What is Next:</strong></p>
<p>This dataset contains many possibilities that could lead to various data exploration and modeling. Some items that could be looked into in the future include:</p>
<p>· Does the time that is taken to affect the overall score of assessment?</p>
<p>· Can I predict scores/age/gender based on the time elapsed on the NPI assessment?</p>
<p>Link to GitHub Repository <a href="https://github.com/scrunts23/Unit-1-build-">here</a>.</p>
<p>Note this is a project for Lambda School within the Data Science track.</p>
<p>Sources:</p>
<p>Raskin, R.; Terry, H. (1988). “A principal-components analysis of the Narcissistic Personality Inventory and further evidence of its construct validity”. Journal of Personality and Social Psychology, Vol 54(5), 890–902.</p>Ryan Davidsonryanhdavidson14@gmail.comMillennial’s: More Narcissistic than Other Generations?