Everyone can do Data Science, Part 2 — Pandas
The following is a guest post by Fabien Durand in the Everyone can do Data Science series. Here, Fabien shows us the 2nd step of his tutorial, how to prepare the data extracted with Import.IO in step 1, before using BigML.
The first step dealt with extracting real estate data from realtor.com using Import.IO. The CSV we generated in the end can be downloaded here: realtor_importio_raw.csv. You can also see how it appears in Google Spreadsheet:
The only problem with the CSV is that errors are present and we need to get rid of them. By errors I mean useless columns, different units of measurement in the column "lot size", some listings are lands without even a building and there are duplicates. We call this procedure cleaning or preparing the data. I will show you how to remove the errors and prepare the data with the help of Pandas.
For those who are unfamiliar with Pandas, it is a very popular Python library designed to facilitate data wrangling tasks. Check out the "10 minutes to Pandas" tutorial for a quick introduction.
To see Pandas capabilities in action, I recommend you to have a look at this tutorial by Alexandre Vallette. In it, Alexandre uses open data to predict elections' abstention rates, and the data-cleaning part is carried out with Pandas.
Wakari to code in the cloud
IPython Notebook is a very nice tool to edit, run and comment your code directly from your web browser. Wakari is a service that can host your notebooks and run them in the cloud. By hosting on Wakari you avoid all the hassle of installing and configuring IPython on your local computer.
I made an IPython Notebook for our example, it is publicly available (see below) so you only need to click on "Run/Edit this notebook" to duplicate it on your Wakari account (you can create one for free). Once you have it on your Wakari account you will be able to upload your CSV in the same directory as the IPython Notebook (the file manager is the left panel).
Cleaning the Data
The notebook I have made shows step by step how to clean the data we got from Import.IO. It will only work for the particular data structure that we have here but you can easily adapt it to other needs and other datasets.
You can see the embedded IPython notebook below but I strongly recommend you to go the original page, click here, as it is way nicer to work directly on Wakari.
Once you have completed all the instructions from the IPython Notebook you will be able to save save the resulting data as a CSV file (the last instruction creates the file). You can download mine here: realtor_importio_cleaned.csv. This is what it looks like in Google Spreadsheet:
Next step: predictive modelling
Now that the data has been cleaned, it is ready to be imported into BigML. We will see in the next article how to use this data to automatically create a model that predicts the value of real estate properties.
Stay tuned!
Other posts in the series
"Everyone can do Data Science"
Fabien is a business student at the EBS University for Business and Law in Germany and at the KEDGE Business School in France. He loves to bring the IT and Business world together. At the moment, everything from Data Science to Digital Innovation brings him much joy! You can find him on LinkedIn and Twitter.