How to Clean Data

ACCESS the FULL COURSE here:
TRANSCRIPT
Hello everybody! In this video we're gonna get started with a data analysis by first discussing a data cleaning, which is an integral part of any kind of data analysis. Basically, data cleaning is all about making sure that your data is ready to go for any other kind of analysis. So this might involve something like looking at columns of our data in a data frame and saying, "Well these column values aren't quit right." It might involve something like removing or renaming columns in a data frame, it might deal with how do we handle null values, or not a number, or missing data in a data frame? So we're gonna be discussing all of those topics, but the first thing that we have to do is download the source code and unzip it somewhere. So when you download the source code off the website and you unzip it, there'll be some CSV files. The main CSV file is flights.csv, and if you need a refresher on what kind of data is in there, you can look at something like the read me. If you open up the read me, you get some information as to what all the columns are and what they mean, as well as some potential values that they might, values and formats that they might take on, also units as well. And then, if you're unfamiliar with some flight terms, you can always look through this terms.csv, and then there's just some other information, such as some codes about the weekdays, some airport IDs, and airline IDs. All this information is critical, so I suggest you take a quick look through some of this data, especially the read me and the terms file, so that we're familiar with the columns.
Okay, so after we have done that, and after you download the environment file and load that up, then we can get started with data cleaning. We want to make sure that we have the right environment and let's launch Spyder, I already have an instance of Spyder running here. So I have a file opened up in Spyder, I'm just gonna save this guy to the same data analysis holder that has all of my CSV files. So I'm just gonna call this data_cleaning.py. Excellent, so the first thing that we'll need to do, is import pandas, so we'll do import pandas as pd, and the other we want to do is actually import numpy as well, because we'll be using it for our data cleaning and we have to have to load our flights data sets. So flight equals pd.read_csv, flights.csv, and then we have to make sure that's indexed correctly, so we'll set that to false, and we can always check just by saying something like print flights. So this is just some code that we'll load on our data and then print out some flight information, so let's run this guy. And after a couple seconds it'll load up our data, so we have a lot of data here, and great! So there's some information here about year, month, air time, and then distance seem to be wrapped over a little bit, sorry about that.
Alright, so we have information and all the numbers seem to check out. Now you notice that it took a little bit, it took some time for us to load in that information, and actually how IPython, this IPython console works, is that we only have to load this in once. From now on what I'll do, is comment this guy out, and just to remind you that this load our flights datasets, I'm gonna comment this guy out and that's because anytime we run this file all of the variables, or things that we defined, are persistent in Ipython. So in here I can just type in flights.dtypes and view all of the different data types for our columns, for example. And then I can do the same thing here because the flights variable has already been established. So is does prevent us from having to run this file and then keep waiting over and over again for the data set to reload, we can just load them once. Alright, so it's interesting that I printed this out because this is what we're gonna be looking at. This flights.dtype just prints out all of the data types of our column so we can take a look. So year, month, day of week, there's day_of_month, day_of_week, and all int64, which we expect them to be integers. FL_date is flight date, and if you notice, this is actually an object, and an object is a string, but this isn't quite the appropriate data type for a date, right? You expect a date to be a date/time object, where in Python we have these things called date/time objects and pandas also there's a date/time object as well. This isn't quite right, if we're trying to do any kind of time series analysis on this, it's a bit tricky to do it on strings, but if we had it as an actual date/time then we could do all kinds of things, like add, or subtract, or multiply, or add or subtract different kinds of date/times and everything would just work together because of pandas.

zenva,zenva lounge,tutorial,programming,coding,software development,web development,online course,data manipulation,pandas library,pandas,python,data science,csv files,read excel files,store data,data

Cristasouma

Bu Blogda Ara

How to Clean Data - Part 1

zenva,zenva lounge,tutorial,programming,coding,software development,web development,online course,data manipulation,pandas library,pandas,python,data science,csv files,read excel files,store data,data

Yorumlar

Yorum Gönder