So we are in 2025 and there are still people who use CSV as data source. No but “Allo what”As we said in 2013!

If you are a developer, data scientist or just someone who is struggling with tons of “tabular” data, this article will save you hours of your life, because 10 GB of CSV in Excel, well that crash! Even with your own Python script, it can also bent a little if you do not make any particular optimization.

So what to do ???

Well it is time to take out heavy artillery by going to Xana tool developed in rust and optimized to the OS which will allow you to process, filter, and transform your tons of data in a few seconds where other tools make the soul instantly. Imagine a little, what Pandas does in 30 seconds, Xan does it in 3 seconds, all by consuming 10 times less memory!

The Xan View command transforms your terminal into an elegant and functional data interface.

So of course, it happens in the terminal, under MacOS, Windows and Linux and like other tools of the same style such as Octosql And Millerit will provide you with many services. Because yes, Xan, which was developed by the medialab of Science Po is a Fork of XSV Completely redesigned for the specific needs of social sciences and web data analysis.

Yes, it is a correlation graph generated directly in your terminal. Magic!

Here are the main controls of the tool:

  • Xan View : data preview in the terminal
  • Xan Filter : line filtering according to a condition
  • Xan Map : creation of new columns
  • Xan goes out : sorting of data
  • Xan Join : CSV file joint
  • Xan Stats : descriptive statistics
  • Xan Frequency : frequency tables
  • Xan Hist : histograms in the terminal
  • Xan DEDUP : DEVELUCUN A File
  • Xan Transform : to make changes to the text (for example switching into lower case certain data & mldr;)
  • Xan Fill : To fill with zero or other, places of the CSV where the values ​​are absent
  • & mldr; etc & mldr; Or more than 50 orders in total covering almost all needs related to CSVs and data analysis.

Analyze time series directly in your terminal, without a single line of MatPlotlib

The interest as I told you is that Xan can deal with huge files with little memory thanks in particular to an intelligent automatic parallelization of treatments. To give you an idea, where a standard pandas script would consume 4 GB of RAM to process a 1 GB CSV file, Xan can accomplish the same task with only 100 MB of memory.

To install it, nothing could be simpler, there are commands for all bones but the best is still to do on PC (Linux, Windows):

Or under macOS:

A heatmap worthy of a visualization D3.JSbut generated entirely in your terminal

Xan also has his own expression language baptized Moonblade (named after the magic sword of Xan in Baldur’s Gateconnoisseurs will recognize & mldr;). It is a syntax halfway between Python and JavaScript, making it possible to easily treat CSV.

Here are some concrete examples:

  • Filtering :: xan filter 'count > 10' data.csv
  • Transformation :: xan transform name 'upper(name)' data.csv
  • Calculation :: xan map 'tweet_count / retweet_count' ratio data.csv
  • Aggregation:: xan agg 'sum(retweet_count), mean(retweet_count)' data.csv

“Multiple small” as Tufte would say, generated in the blink of an eye in your terminal

This allows Xan to be piloted in your programs, without the need to install a lib. And it’s just as optimized! There are also functions to treat everything that is dates, character strings, urls & mldr; etc. Using UNIX pipes to chain controls, you will be the master of the world!

To give you a concrete example, here is how to quickly analyze a media database:

# Télécharger un jeu de données d'exemple
curl -LO https://github.com/medialab/corpora/raw/master/polarisation/medias.csv

# Explorer rapidement le fichier
xan headers medias.csv
xan count medias.csv
xan view medias.csv

# Quelques analyses basiques
xan stats -s indegree,foundation_year medias.csv
xan frequency -s edito medias.csv | xan hist

# Filtrer et transformer
xan filter 'foundation_year > 2000' medias.csv > recents.csv
xan map 'fmt("{} ({})", name, foundation_year)' display_name recents.csv > result.csv

Brief, Another great tool that will save you serious time ! Use it for example in your Python scripts or integrate it into your data processing pipelines, you will not be disappointed! Even the documentation is phew !


Source link

Categorized in: