Working on large data sets in Python

When working in IT security event monitoring, you will accumulate tons of log data which you will want to perform operations on. If you know Python, you will most likely know Pandas. When I describe Pandas to people who do not know it, I usually describe it as “Excel for Python”.

It’s easy to get started, but possibilities are endless. And it’s lightning fast. In 5 lines of code you can read a zipped file, search, delete, manipulate data, and save it as neatly formatted XLS.

The restrictions you will encounter therefore will not so much come from Pandas, but from your development environment: I recently needed to work with a 30GB CSV file and simply ran out of memory. But no worries, Pandas can handle big data and simply read large files in chunks.

While the script I was using is rather specific, I thought it might be worthwhile to share, because it is a useful collection of different techniques: Loading huge and compressed files, efficiently copying dataframe structures and combining dataframes, deleting empty cells and duplicate data.

The script itself was written to keep one exemplary log event for each Microsoft Windows Log Event ID from a large SIEM export containing several gigabytes of log data. With this export I always have a quick exemplary event on hand to write and test with my own Sigma rules.

The script is rather quick and dirty, but feel free to take the bits and pieces you need and hopefully it saves you some time from browsing StackExchange 😉

Purple Serendipity

Mais les braves gens n'aiment pas que l'on suive une autre route qu'eux

Working on large data sets in Python

Leave a Reply Cancel reply