Lambda School Data Science - A First Look at Data¶
Lecture - let's explore Python DS libraries and examples!¶
The Python Data Science ecosystem is huge. You've seen some of the big pieces - pandas, scikit-learn, matplotlib. What parts do you want to see more of?
import numpy as np
np.random.randint(0, 10, size=10)
import matplotlib.pyplot as plt
x = [1, 2, 3, 4]
y = [2, 4, 6, 10]
print(x ,y)
plt.scatter(x, y, color='k')
plt.plot(x, y, color='r')
import pandas as pd
df = pd.DataFrame({'first_col': x, "second_col": y})
df
df['first_col']
Assignment - now it's your turn¶
Pick at least one Python DS library, and using documentation/examples reproduce in this notebook something cool. It's OK if you don't fully understand it or get it 100% working, but do put in effort and look things up.
#create two numpy arrays of 25 random integers
x_mine = np.random.randint(0, 100, size = 25)
y_mine = np.random.randint(0, 100, size = 25)
#assign those integers to a dataframe
df = pd.DataFrame({'column_1': x_mine, 'column_2': y_mine})
#find the absolute value between the column_1 and column_2 and assign it to a new column
df['column_3'] = (df['column_1'] - df['column_2']).abs()
#plot column_1 and column_2 as a scatter plot, with some bells & whistles
area = df['column_3']**2/2
colors = np.random.rand(25)
plt.scatter(x_mine, y_mine, s=area, c = colors)
plt.show()
Assignment questions¶
After you've worked on some code, answer the following questions in this text block:
Describe in a paragraph of text what you did and why, as if you were writing an email to somebody interested but nontechnical.
What was the most challenging part of what you did?
What was the most interesting thing you learned?
What area would you like to explore with more time?
Assignment Answers
1) The task at hand was to demonstrate a basic understanding of the iconic trio of Data Science Python libraries. I first utilized Numpy, a library used for advanced mathematical operations, to build two sets of 25 random numbers between 1 and 100. I then used Pandas to organize these datasets into what's called a Data Frame, which is Pandas-speak for an table that you might see in an spreadsheet program like Excel or Google Sheets. Using Numpy's absolute value function I was able to calculate the distance between the values in each of the datasets and assign it to a third column. Finally, I used Matplotlib to visualize all of the data. The X and Y axes are the original random values, while the size of the plots represent the squared and halved values of our absolute distance value in the third column.
2) The most challenging part of this assignment was trying to find a way to incorporate a third dataset into a 2D scatterplot. Matplotlib's website provides many examples, and I found what I was looking for there.
3) The most interesting thing I learned were the additional parameters for the scatterplot. There are many more than what I used, but using anymore seemed like overkill for this assignment.
4) I'm interested to see more about the colormap attribute for scatterplots. It seems like it could be a very unique way to visualize certain sets of data.
Stretch goals and resources¶
Following are optional things for you to take a look at. Focus on the above assignment first, and make sure to commit and push your changes to GitHub (and since this is the first assignment of the sprint, open a PR as well).
- pandas documentation
- scikit-learn documentation
- matplotlib documentation
- Awesome Data Science - a list of many types of DS resources
Stretch goals:
- Find and read blogs, walkthroughs, and other examples of people working through cool things with data science - and share with your classmates!
- Write a blog post (Medium is a popular place to publish) introducing yourself as somebody learning data science, and talking about what you've learned already and what you're excited to learn more about.
Stretch Goals Medium Post: https://medium.com/@stephenplainte/my-first-day-as-a-data-science-student-at-lambda-school-ea29d2889452