How to Quickly Extract Data from Big Datasets in 3 Steps

Photo by fabio on Unsplash

How to Quickly Extract Data from Big Datasets in 3 Steps

ยท

9 min read

So I wanted to do a data science project based on one I found that was similar, but I wanted to use a more recent dataset, rather than some random one that was out-of-date and pulled from an API that hasn't been updated in awhile.

I wanted to do the data extraction and the clean up on my own, because I like to make things challenging.

Plus, it would give me the experience of what would happen in the real world - as you never get bloated files with missing values. ๐Ÿ˜œ Just kidding.

I found such a dataset from SimFin, which contains stock data for all US traded companies from the past 5 or so years - and it is a massive CSV file with over 5.3 million rows and sits at 354MB in size! ๐Ÿ™€

In its current state, I can't do any analysis on it without it blowing up my laptop and I also can't load it to anything (e.g. Colab, Github) either. The goal is to extract a years's worth of data on all stocks traded on in the US.

So I have my work cut out for me.๐Ÿ‘ทโ€โ™€๏ธ

Step 1: Load the dataset using nrows

I did a bulk download of the CSV from SimFin and loaded it into my Jupyter notebook, just to see how big and how long it would take. If you don't specify the number of rows to load, pandas will load them ALL and eat up your system's memory - in this case, it is using 1013.4 MB and it is so slow!

# Pull in all the data and time it to see how long it takes 
%%time
df = pd.read_csv('us-shareprices-daily.csv', delimiter=';')
CPU times: user 1.95 s, sys: 657 ms, total: 2.61 s
Wall time: 2.85 s
5322568

And it took almost 3 seconds to download!

Now, let's look at the dataframe's info.

df.info(verbose=True, memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5322568 entries, 0 to 5322567
Data columns (total 11 columns):
 #   Column              Dtype  
---  ------              -----  
 0   Ticker              object 
 1   SimFinId            int64  
 2   Date                object 
 3   Open                float64
 4   High                float64
 5   Low                 float64
 6   Close               float64
 7   Adj. Close          float64
 8   Volume              int64  
 9   Dividend            float64
 10  Shares Outstanding  float64
dtypes: float64(7), int64(2), object(2)
memory usage: 1013.4 MB

So this is using a ton of memory, so let's fiddle with some things.

Let's see what columns have missing data on the original dataset and if I even need to use those by running df.isnull().sum()

print(df.isnull().sum())
Ticker                      0
SimFinId                    0
Date                        0
Open                        0
High                        0
Low                         0
Close                       0
Adj. Close                  0
Volume                      0
Dividend              5288387
Shares Outstanding     380181
dtype: int64

Now that I have a clearer picture of what is in the dataset, I can work on cleaning it.

Using nrows

If you use the nrows option in read_csv function, you can dramatical decrease memory usage as you work to clean up your data on a much smaller dataset, thereby saving some memory power. In this case, I want to only bring in 1000 rows so I can do some cleanup and get a general feel for what datatypes I need to convert to.

# Pull in only 1000 records
df  = pd.read_csv('us-shareprices-daily.csv', delimiter=';', nrows=1000)

# Get detailed information about the DataFrame dfy
df.info(verbose=False, memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 11 entries, Ticker to Shares Outstanding
dtypes: float64(6), int64(3), object(2)
memory usage: 192.5 KB

In this case, just bringing in only 1000 records to work with has dramatically decreased the memory usage from 1013.4MB down to 192.5KB! ๐Ÿคฏ

Step 2: Clean up columns

Edit or remove columns

Extract only the columns you need from the data frame and figure out what datatypes are best for the data you are bringing in. So let's look at what columns are in my dataset.

# extract column names
cols = df.columns.values
cols

Here's the column list for my dataset.

array(['Ticker', 'SimFinId', 'Date', 'Open', 'High', 'Low', 'Close',
       'Adj. Close', 'Volume', 'Dividend', 'Shares Outstanding'],
      dtype=object)

In my case, I don't need Shares Outstanding, Adj.Close or the Dividend columns for my analysis, plus there's a lot of missing data in 2 of these columns, so 3 birds 1 stone.

Now, I will define the list of columns that I do need and store them in variable, required_cols.

required_cols = ['Ticker', 'SimFinId', 'Date', 'Open', 'High', 'Low', 'Close', 'Volume']

I will have to load these columns after moving the required_cols variable above the read_csv call and added the option usecols=required_cols to it.

import pandas as pd

required_cols = ['Ticker', 'SimFinId', 'Date', 'Open', 'High', 'Low', 'Close', 'Volume']

df  = pd.read_csv('us-shareprices-daily.csv', delimiter=';', nrows=1000, usecols=required_cols)

Load the dataset again with only the required columns and then rerun df.info(verbose=False, memory_usage='deep'). I now only have the ones I specified above.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   Ticker    1000 non-null   category
 1   SimFinId  1000 non-null   int64   
 2   Date      1000 non-null   object  
 3   Open      1000 non-null   float32 
 4   High      1000 non-null   float32 
 5   Low       1000 non-null   float32 
 6   Close     1000 non-null   float32 
 7   Volume    1000 non-null   int64   
dtypes: category(1), float32(4), int64(2), object(1)
memory usage: 97.8 KB

Now my memory usage is down to 97.8 KB from 192.5 KB!

Convert the column datatypes

Different data types require varying amounts of memory, so by converting columns to more memory-efficient data types, I can significantly reduce the memory footprint of my dataset when I do any analysis.

Run describe() on your data to see the statistics and to give you a better idea of what data types would actually work best and to see what breaks when you convert a datatype from another.

If you use .info() on your dataset, you can get column and datatype information:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Ticker    1000 non-null   object 
 1   SimFinId  1000 non-null   int64  
 2   Date      1000 non-null   object 
 3   Open      1000 non-null   float64
 4   High      1000 non-null   float64
 5   Low       1000 non-null   float64
 6   Close     1000 non-null   float64
 7   Volume    1000 non-null   int64  
dtypes: float64(4), int64(2), object(2)
memory usage: 62.6+ KBpl

Let's have my dataset use memory-efficient data types. So instead of having it use float64 and int64, I will use float32 and int32, that way I don't lose any values after the conversion.

Depending on the data type, it performs one of the following conversions:

  • If the column has a data type of float64, it converts the column to the more memory-efficient data type float 32 or float16.

  • If the column has a data type of int64, it converts the column to the more memory-efficient data type int32 or int16.

  • If the column has a data type of object, it converts the column to the data type category'. This can be useful for categorical data with a limited number of unique values and can save memory.

So in my case, I decided that the Ticker should be a category instead of an object and that my float values should be float32 and all the integers should be Int32. You will have to tweak these datatypes for your own needs. So here's my final code and output, if I wanted to convert an existing dataframe I would run this code:

for col in df.columns:
    if df[col].dtype == 'float64':
        df[col] = df[col].astype('float32')
    if df[col].dtype == 'int64':
        df[col] = df[col].astype('int32')
    if df[col].dtype == 'object' and df[col].name == 'Ticker':
        df[col] = df[col].astype('category')
    if df[col].name == 'Date':
        df[col] = df[col].astype('datetime64[ns]')

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Ticker    1000 non-null   category      
 1   SimFinId  1000 non-null   int32         
 2   Date      1000 non-null   datetime64[ns]
 3   Open      1000 non-null   float32       
 4   High      1000 non-null   float32       
 5   Low       1000 non-null   float32       
 6   Close     1000 non-null   float32       
 7   Volume    1000 non-null   int32         
dtypes: category(1), datetime64[ns](1), float32(4), int32(2)
memory usage: 32.4 KB

You can see that my memory usage has been reduced to 32.4 KB from 62.6 KB. So the datatype conversion reduced my memory consumption yet again.

Run df.head() and df.describe to make sure nothing odd happened to your data after converting it.

But I want to add this conversion when I bring in data, so let's add my resulting datatype setup as dictionary and into variable, dt_setup, and add this option, dtype=dt_setup , to my read_csv() :

dt_setup = {
    'Ticker':'category',
    'SimFinId':'int32',
    'Open':'float32',
    'High':'float32',
    'Low':'float32',
    'Close':'float32',
    'Volume':'int32'
}

df = pd.read_csv('us-shareprices-daily.csv', delimiter=';', nrows=1000, usecols=required_cols, dtype=dt_setup,  parse_dates=[2])

Then I will parse the Date column using the parse_dates=[index] option, so that I will not need to use the following manual code to do so.

for col in df.columns:
    if df[col].name == 'Date':
        df[col] = df[col].astype('datetime64[ns]')

Step 3: Pull data into a new CSV

Filter the data

I only want a year's worth of data for all stocks, so I will sort for the data I need by date and temporarily change the type so that Pandas can use it.

# get data from year 2022
df['Date'] = pd.to_datetime(df['Date']) 
sp_data_2022 = df[df['Date'].dt.year == 2022].reset_index()
sp_data_2022

My new dataframe is now much smaller and only contains just over 1.1 million records, sp_data_2022.info() . It's still large, but now it's manageable.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1171074 entries, 0 to 1171073
Data columns (total 9 columns):
 #   Column    Non-Null Count    Dtype         
---  ------    --------------    -----         
 0   index     1171074 non-null  int64         
 1   Ticker    1171074 non-null  object        
 2   SimFinId  1171074 non-null  int64         
 3   Date      1171074 non-null  datetime64[ns]
 4   Open      1171074 non-null  float64       
 5   High      1171074 non-null  float64       
 6   Low       1171074 non-null  float64       
 7   Close     1171074 non-null  float64       
 8   Volume    1171074 non-null  int64         
dtypes: datetime64[ns](1), float64(4), int64(3), object(1)
memory usage: 80.4+ MB

Prep dataframe for conversion

Although my dataframe is smaller than the original, I still need to change the Date column back to an object type before I convert my new dataframe into a CSV file.

from datetime import datetime
def convert_datetime(dt):
    return datetime.strftime(dt, '%Y-%m-%d') # Change it back to the original format it came in

sp_data_2022['Date']= stock_data_2022['Date'].apply(convert_datetime)

Running info() again, shows that the dataframe is even smaller, just by changing the Date column back to an object.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1171074 entries, 0 to 1171073
Data columns (total 9 columns):
 #   Column    Non-Null Count    Dtype   
---  ------    --------------    -----   
 0   index     1171074 non-null  int64   
 1   Ticker    1171074 non-null  category
 2   SimFinId  1171074 non-null  int64   
 3   Date      1171074 non-null  object  
 4   Open      1171074 non-null  float32 
 5   High      1171074 non-null  float32 
 6   Low       1171074 non-null  float32 
 7   Close     1171074 non-null  float32 
 8   Volume    1171074 non-null  int64   
dtypes: category(1), float32(4), int64(3), object(1)
memory usage: 56.0+ MB

So now I can convert my new, filtered and cleaned data into a smaller CSV file. I will also exclude the index column by adding the option index=False.

sp_data_2022.to_csv('US_Share_Prices_2022.csv', index=False)

My resulting file is now a much more manageable size at 60 MB, plus I have a more current and clean dataset to work with.

If you want to see exactly how I did the data cleanup above in detail, feel free to view my Jupyter notebook in my Github repository.

Happy coding, friends!

ย