How to Quickly Extract Data from Big Datasets in 3 Steps
So I wanted to do a data science project based on one I found that was similar, but I wanted to use a more recent dataset, rather than some random one that was out-of-date and pulled from an API that hasn't been updated in awhile.
I wanted to do the data extraction and the clean up on my own, because I like to make things challenging.
Plus, it would give me the experience of what would happen in the real world - as you never get bloated files with missing values. ๐ Just kidding.
I found such a dataset from SimFin, which contains stock data for all US traded companies from the past 5 or so years - and it is a massive CSV file with over 5.3 million rows and sits at 354MB in size! ๐
In its current state, I can't do any analysis on it without it blowing up my laptop and I also can't load it to anything (e.g. Colab, Github) either. The goal is to extract a years's worth of data on all stocks traded on in the US.
So I have my work cut out for me.๐ทโโ๏ธ
Step 1: Load the dataset using nrows
I did a bulk download of the CSV from SimFin and loaded it into my Jupyter notebook, just to see how big and how long it would take. If you don't specify the number of rows to load, pandas will load them ALL and eat up your system's memory - in this case, it is using 1013.4 MB and it is so slow!
# Pull in all the data and time it to see how long it takes
%%time
df = pd.read_csv('us-shareprices-daily.csv', delimiter=';')
CPU times: user 1.95 s, sys: 657 ms, total: 2.61 s
Wall time: 2.85 s
5322568
And it took almost 3 seconds to download!
Now, let's look at the dataframe's info.
df.info(verbose=True, memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5322568 entries, 0 to 5322567
Data columns (total 11 columns):
# Column Dtype
--- ------ -----
0 Ticker object
1 SimFinId int64
2 Date object
3 Open float64
4 High float64
5 Low float64
6 Close float64
7 Adj. Close float64
8 Volume int64
9 Dividend float64
10 Shares Outstanding float64
dtypes: float64(7), int64(2), object(2)
memory usage: 1013.4 MB
So this is using a ton of memory, so let's fiddle with some things.
Let's see what columns have missing data on the original dataset and if I even need to use those by running df.isnull().sum()
print(df.isnull().sum())
Ticker 0
SimFinId 0
Date 0
Open 0
High 0
Low 0
Close 0
Adj. Close 0
Volume 0
Dividend 5288387
Shares Outstanding 380181
dtype: int64
Now that I have a clearer picture of what is in the dataset, I can work on cleaning it.
Using nrows
If you use the nrows
option in read_csv
function, you can dramatical decrease memory usage as you work to clean up your data on a much smaller dataset, thereby saving some memory power. In this case, I want to only bring in 1000 rows so I can do some cleanup and get a general feel for what datatypes I need to convert to.
# Pull in only 1000 records
df = pd.read_csv('us-shareprices-daily.csv', delimiter=';', nrows=1000)
# Get detailed information about the DataFrame dfy
df.info(verbose=False, memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 11 entries, Ticker to Shares Outstanding
dtypes: float64(6), int64(3), object(2)
memory usage: 192.5 KB
In this case, just bringing in only 1000 records to work with has dramatically decreased the memory usage from 1013.4MB down to 192.5KB! ๐คฏ
Step 2: Clean up columns
Edit or remove columns
Extract only the columns you need from the data frame and figure out what datatypes are best for the data you are bringing in. So let's look at what columns are in my dataset.
# extract column names
cols = df.columns.values
cols
Here's the column list for my dataset.
array(['Ticker', 'SimFinId', 'Date', 'Open', 'High', 'Low', 'Close',
'Adj. Close', 'Volume', 'Dividend', 'Shares Outstanding'],
dtype=object)
In my case, I don't need Shares Outstanding, Adj.Close or the Dividend columns for my analysis, plus there's a lot of missing data in 2 of these columns, so 3 birds 1 stone.
Now, I will define the list of columns that I do need and store them in variable, required_cols
.
required_cols = ['Ticker', 'SimFinId', 'Date', 'Open', 'High', 'Low', 'Close', 'Volume']
I will have to load these columns after moving the required_cols
variable above the read_csv
call and added the option usecols=required_cols
to it.
import pandas as pd
required_cols = ['Ticker', 'SimFinId', 'Date', 'Open', 'High', 'Low', 'Close', 'Volume']
df = pd.read_csv('us-shareprices-daily.csv', delimiter=';', nrows=1000, usecols=required_cols)
Load the dataset again with only the required columns and then rerun df.info(verbose=False, memory_usage='deep')
. I now only have the ones I specified above.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Ticker 1000 non-null category
1 SimFinId 1000 non-null int64
2 Date 1000 non-null object
3 Open 1000 non-null float32
4 High 1000 non-null float32
5 Low 1000 non-null float32
6 Close 1000 non-null float32
7 Volume 1000 non-null int64
dtypes: category(1), float32(4), int64(2), object(1)
memory usage: 97.8 KB
Now my memory usage is down to 97.8 KB from 192.5 KB!
Convert the column datatypes
Different data types require varying amounts of memory, so by converting columns to more memory-efficient data types, I can significantly reduce the memory footprint of my dataset when I do any analysis.
Run describe()
on your data to see the statistics and to give you a better idea of what data types would actually work best and to see what breaks when you convert a datatype from another.
If you use .info()
on your dataset, you can get column and datatype information:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Ticker 1000 non-null object
1 SimFinId 1000 non-null int64
2 Date 1000 non-null object
3 Open 1000 non-null float64
4 High 1000 non-null float64
5 Low 1000 non-null float64
6 Close 1000 non-null float64
7 Volume 1000 non-null int64
dtypes: float64(4), int64(2), object(2)
memory usage: 62.6+ KBpl
Let's have my dataset use memory-efficient data types. So instead of having it use float64 and int64, I will use float32 and int32, that way I don't lose any values after the conversion.
Depending on the data type, it performs one of the following conversions:
If the column has a data type of
float64
, it converts the column to the more memory-efficient data typefloat 32
orfloat16
.If the column has a data type of
int64
, it converts the column to the more memory-efficient data typeint32
orint16
.If the column has a data type of
object
, it converts the column to the data typecategory'
. This can be useful for categorical data with a limited number of unique values and can save memory.
So in my case, I decided that the Ticker
should be a category instead of an object and that my float values should be float32
and all the integers should be Int32
. You will have to tweak these datatypes for your own needs. So here's my final code and output, if I wanted to convert an existing dataframe I would run this code:
for col in df.columns:
if df[col].dtype == 'float64':
df[col] = df[col].astype('float32')
if df[col].dtype == 'int64':
df[col] = df[col].astype('int32')
if df[col].dtype == 'object' and df[col].name == 'Ticker':
df[col] = df[col].astype('category')
if df[col].name == 'Date':
df[col] = df[col].astype('datetime64[ns]')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Ticker 1000 non-null category
1 SimFinId 1000 non-null int32
2 Date 1000 non-null datetime64[ns]
3 Open 1000 non-null float32
4 High 1000 non-null float32
5 Low 1000 non-null float32
6 Close 1000 non-null float32
7 Volume 1000 non-null int32
dtypes: category(1), datetime64[ns](1), float32(4), int32(2)
memory usage: 32.4 KB
You can see that my memory usage has been reduced to 32.4 KB from 62.6 KB. So the datatype conversion reduced my memory consumption yet again.
Run df.head()
and df.describe
to make sure nothing odd happened to your data after converting it.
But I want to add this conversion when I bring in data, so let's add my resulting datatype setup as dictionary and into variable, dt_setup
, and add this option, dtype=dt_setup
, to my read_csv()
:
dt_setup = {
'Ticker':'category',
'SimFinId':'int32',
'Open':'float32',
'High':'float32',
'Low':'float32',
'Close':'float32',
'Volume':'int32'
}
df = pd.read_csv('us-shareprices-daily.csv', delimiter=';', nrows=1000, usecols=required_cols, dtype=dt_setup, parse_dates=[2])
Then I will parse the Date
column using the parse_dates=[index]
option, so that I will not need to use the following manual code to do so.
for col in df.columns:
if df[col].name == 'Date':
df[col] = df[col].astype('datetime64[ns]')
Step 3: Pull data into a new CSV
Filter the data
I only want a year's worth of data for all stocks, so I will sort for the data I need by date and temporarily change the type so that Pandas can use it.
# get data from year 2022
df['Date'] = pd.to_datetime(df['Date'])
sp_data_2022 = df[df['Date'].dt.year == 2022].reset_index()
sp_data_2022
My new dataframe is now much smaller and only contains just over 1.1 million records, sp_data_2022.info()
. It's still large, but now it's manageable.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1171074 entries, 0 to 1171073
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 1171074 non-null int64
1 Ticker 1171074 non-null object
2 SimFinId 1171074 non-null int64
3 Date 1171074 non-null datetime64[ns]
4 Open 1171074 non-null float64
5 High 1171074 non-null float64
6 Low 1171074 non-null float64
7 Close 1171074 non-null float64
8 Volume 1171074 non-null int64
dtypes: datetime64[ns](1), float64(4), int64(3), object(1)
memory usage: 80.4+ MB
Prep dataframe for conversion
Although my dataframe is smaller than the original, I still need to change the Date
column back to an object type before I convert my new dataframe into a CSV file.
from datetime import datetime
def convert_datetime(dt):
return datetime.strftime(dt, '%Y-%m-%d') # Change it back to the original format it came in
sp_data_2022['Date']= stock_data_2022['Date'].apply(convert_datetime)
Running info()
again, shows that the dataframe is even smaller, just by changing the Date column back to an object.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1171074 entries, 0 to 1171073
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 1171074 non-null int64
1 Ticker 1171074 non-null category
2 SimFinId 1171074 non-null int64
3 Date 1171074 non-null object
4 Open 1171074 non-null float32
5 High 1171074 non-null float32
6 Low 1171074 non-null float32
7 Close 1171074 non-null float32
8 Volume 1171074 non-null int64
dtypes: category(1), float32(4), int64(3), object(1)
memory usage: 56.0+ MB
So now I can convert my new, filtered and cleaned data into a smaller CSV file. I will also exclude the index column by adding the option index=False
.
sp_data_2022.to_csv('US_Share_Prices_2022.csv', index=False)
My resulting file is now a much more manageable size at 60 MB, plus I have a more current and clean dataset to work with.
If you want to see exactly how I did the data cleanup above in detail, feel free to view my Jupyter notebook in my Github repository.
Happy coding, friends!