Working on large data sets in Python

When working in IT security event monitoring, you will accumulate tons of log data which you will want to perform operations on. If you know Python, you will most likely know Pandas. When I describe Pandas to people who do not know it, I usually describe it as “Excel for Python”.

It’s easy to get started, but possibilities are endless. And it’s lightning fast. In 5 lines of code you can read a zipped file, search, delete, manipulate data, and save it as neatly formatted XLS.

The restrictions you will encounter therefore will not so much come from Pandas, but from your development environment: I recently needed to work with a 30GB CSV file and simply ran out of memory. But no worries, Pandas can handle big data and simply read large files in chunks.

While the script I was using is rather specific, I thought it might be worthwhile to share, because it is a useful collection of different techniques: Loading huge and compressed files, efficiently copying dataframe structures and combining dataframes, deleting empty cells and duplicate data.

The script itself was written to keep one exemplary log event for each Microsoft Windows Log Event ID from a large SIEM export containing several gigabytes of log data. With this export I always have a quick exemplary event on hand to write and test with my own Sigma rules.

The script is rather quick and dirty, but feel free to take the bits and pieces you need and hopefully it saves you some time from browsing StackExchange 😉

# coding: utf-8

__author__ = 'Fabian Voith'
__version__ = '1.0.0'
__email__ = ''

import pandas as pd

# The CSV file (can be in zip file) MUST have the columns "Event Name", "Event ID (custom)", and "Unnamed: 9" (containing the Payload).
# Apart from that, the following structure is recommended:
#['Event Name',
# 'Low Level Category',
# 'Source IP',
# 'Source Port',
# 'Destination IP',
# 'Destination Port',
# 'Username',
# 'Event ID (custom)',
# 'AgentLogFile (custom)',
# 'Unnamed: 9',
# 'Unnamed: 10',
# 'Unnamed: 11']

filename = r'*Insert your filename to be processed here; can also be zipped, e.g. C:\*'

processed_ids = []

first_run = True
chunk_counter = 0

chunksize = 10 ** 4  # number of rows per chunk
for chunk in pd.read_csv(filename, chunksize=chunksize, sep=','):

    chunk_counter += 1
    print('Working on chunk ' + str(chunk_counter))
    # Rename columns
    chunk.rename(columns={'EventID (custom)':'EventID', 'Unnamed: 9':'Payload', 
                          'Event Name':'EventName'}, inplace=True)
    # Drop unused Payload data
    # axis=1 is for columns, axis=0 for rows
    chunk.drop({'Unnamed: 10','Unnamed: 11'}, axis=1, inplace=True) 
    # Keep only one EventID and drop duplicates
    chunk = chunk.drop_duplicates('EventID', keep='last')
    if first_run:
        # create new Dataframe, copy structure from original table
        new_df = chunk.reindex_like(chunk)
        # drop empty cells
        first_run = False
    # combine chunk with all eventIDs we have so far
    new_df = pd.concat([chunk, new_df])

if chunk_counter > 1:
    # if we read more than one chunk, just removing duplicates from the chunk might still result in duplicates
    # because we are merging several chunks to a new DF. So each chunk is free of duplicates, but among the chunks
    # there might be identical Event IDs, so we need to remove duplicates from the new DF as well:
    new_df = new_df.drop_duplicates('EventID', keep='last')
new_df = new_df.sort_values(['EventID'], ascending=[True])

print('Writing file...')

print('Done. Wrote ' + str(len(new_df)) + ' lines.')

Leave A Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.