Hi, I'm Tanin (@tanin). I'm a programmer living in Seattle. This is my main blog. If you are looking for a technical blog, please go here. Enjoy!

Analyze log files with SQL using Superintendent.app

I’ve launched Superintendent.app 6 months ago. Initially, I built it to make me more productive at work since I work with a lot of CSV files, and I loathe using Excel. After building it, I’ve realized that having a Desktop app that can handle GBs of CSV files is really versatile. I can analyze CSV files from customers, cross-check (i.e. join) them with internal data, reformat them in some ways, and etc.

Analyzing log files is another interesting use case. I’m sure there are tons of tools that can do the same thing, though I’m not sure if those tools are backed by a real database. Superintendent.app is backed by Sqlite, a real database, which can result in a different set of strengths and weaknesses e.g. medium initial load time (10s for 1GB file), quick query time, and powerful functionality.

To recap, Superintedent.app is an offline Desktop app that allows you to import CSV/TSV files (and several other tabular formats) and write SQLs on those files. Here’s how the app looks like:

As a desktop app, Superintendent.app can handle GBs of files easily. Moreover, the privacy and security aspect of an offline Desktop app yields better peace of mind.

Superintendent.app also enables you to cross-check (i.e. join) records between the log file and a database table (exported in CSV).

Since Superintendent.app is backed by Sqlite, which is powerful but still quite limited in SQL capabilities compared to other SQL-style databases), we’ve added 3 convenient SQL functions to make our lives easier, namely, date_parse, regex_extract, and regex_replace.

Because I’m a big fan of papertrail.com, I’d like to follow their example of reading log files: https://www.papertrail.com/help/permanent-log-archives

Instead of using command-line, I’ll use Superintendent.app to perform the same task.

Setup

I’ll use the log files from https://superintendent.app as an example. The schema of a log file looks like below:

id
generated_at
received_at
source_id
source_name
source_ip
facility_name
severity_name
program
message

We’ll need to add the header row as the first row manually since a log file doesn’t have one.

After adding the header row, the file looks like below:

id  generated_at  received_at source_id source_name source_ip facility_name severity_name program message
1400180097685377024 2021-11-24 15:02:51 2021-11-24 15:02:51 8113625781  superintendent-prod 3.310.241.96  Local0  Error app/postgres.1262692  [DATABASE] [9-1]  sql_error_code = 28000 FATAL:  no pg_hba.conf entry for host "48.136.164.110", user "postgres", database "postgres", SSL off
1400180234126229518 2021-11-24 14:59:22 2021-11-24 15:03:24 8113625781  superintendent-prod 10.215.223.49 Local0  Info  app/heroku-postgres source=DATABASE addon=postgresql-spherical-57847 sample#current_transaction=1879 sample#db_size=10678831bytes sample#tables=3 sample#active-connections=22 sample#waiting-connections=0 sample#index-cache-hit-rate=0.99313 sample#table-cache-hit-rate=0.98474 sample#load-avg-1m=0 sample#load-avg-5m=0 sample#load-avg-15m=0 sample#read-iops=0 sample#write-iops=0.074205 sample#tmp-disk-used=542765056 sample#tmp-disk-available=72436027392 sample#memory-total=4023192kB sample#memory-free=178744kB sample#memory-cached=3279752kB sample#memory-postgres=53508kB sample#wal-percentage-used=0.06477739851760235
1400181318660173834 2021-11-24 15:06:27 2021-11-24 15:07:42 8113625781  superintendent-prod 56.92.230.219 Local0  Info  app/heroku-postgres source=DATABASE addon=postgresql-spherical-57847 sample#current_transaction=1879 sample#db_size=10678831bytes sample#tables=3 sample#active-connections=22 sample#waiting-connections=0 sample#index-cache-hit-rate=0.99313 sample#table-cache-hit-rate=0.98474 sample#load-avg-1m=0 sample#load-avg-5m=0 sample#load-avg-15m=0 sample#read-iops=0 sample#write-iops=0.097046 sample#tmp-disk-used=542765056 sample#tmp-disk-available=72436027392 sample#memory-total=4023192kB sample#memory-free=177796kB sample#memory-cached=3279804kB sample#memory-postgres=53508kB sample#wal-percentage-used=0.06477739851760235
...more rows...

Loading the TSV into Superintendent.app yields the screenshot below:

Initial load

Now we can follow the usage examples explained here: https://www.papertrail.com/help/permanent-log-archives

Usage example

Show identical messages

Papertrail suggests the command below:

gzip -cd 2021-11-24-15.tsv.gz | cut -f10 | sort | uniq -c | sort -n

An equivalent SQL is:

SELECT COUNT(*), message FROM "2021_11_24_15" GROUP BY message ORDER BY COUNT(*) DESC

It would yield a result as shown below:

Identify identical messages

Show similar message

Papertrail suggests the command below:

$ gzip -cd *.tsv.gz  | # extract all archives
    cut -f 5,9-      | # sender, program, message
    tr -s '\t' ' '   | # squeeze whitespace
    tr -s 0-9 0      | # squeeze digits
    cut -d' ' -f 1-8 | # truncate after eight words
    sort | uniq -c | sort -n

An equivalent SQL is:

WITH sanitized AS (
  SELECT
    REGEX_EXTRACT(
      '^(([^\s]+ ){0,8})', -- Get the first 8 words
      REGEX_REPLACE('[0-9]+', REGEX_REPLACE('\s+', message, ' ', false), '0', false) -- Squeeze whitespaces and digits
    ) AS message,
    REGEX_REPLACE('[0-9]+', REGEX_REPLACE('\s+', program, ' ', false), '0', false) AS program,
    REGEX_REPLACE('[0-9]+', REGEX_REPLACE('\s+', source_name, ' ', false), '0', false) AS source_name
  FROM "2021_11_24_15"
)

SELECT COUNT(*) AS occurence, message FROM sanitized GROUP BY message ORDER BY COUNT(*) DESC

The above SQL yields the result below:

Identify similar messages

I wish Sqlite supports define a user-defined function within a SQL statement, so the code will be cleaner. But this is an improvement for another day.

Searching

Searching is the simplest one because you can use column LIKE '%something%' or regex_extract('regex', column) IS NOT NULL in order to search for a specific string.

For example, if you want to search for the log lines where write-iops is more than or equal to 0.087, you can use the SQL below:

SELECT
  CAST(regex_extract('write-iops=([0-9\.]+)', message) AS decmial) AS write_iops,
  regex_extract('(.{0,30}write-iops=[0-9\.]+.{0,30})', message) AS context, -- Show context
  *
FROM "2021_11_24_15"
WHERE write_iops >= 0.087

The SQL yields the result as shown below:

Search

Beautiful, isn’t it?

Wrapping up

If you are familiar with SQL, using SQL to analyze log files makes a lot of sense. You can visit https://superintendent.app to try it out.

Thank you for reading until the end. I’ll see you next time.

Use SQL instead of Spreadsheet/Excel

Since starting at a new company, I’ve been working on an accounting report that comes in the form of CSVs, so I’ve worked with a lot of CSVs.

As a person who knows SQL well but isn’t good at Excel, working with CSV files in Excel is such a nightmare. Every time I was doing something, I felt that I would have spent 10x less time if I could use SQL.

VLOOKUP and pivot tables are absolutely difficult to use. It’s not just writing stuff. It’s a combination of coding and meddling with UI.

SQL is much more ergonomic. It’s just a piece of text. There are few benefits of being “a piece of text”:

  1. You can save the SQL in a file and reuse it on different CSV files that have the same format.
  2. It’s easier to check for correctness because the whole logic is contained within a single SQL.

Besides the above, there’s a blocker where Excel cannot load a CSV with more than 1M rows. Using SQL with an actual database won’t encounter this blocker.

In this blog, I’d like to show how VLOOKUP and pivot tables can be replaced with SQL, and hopefully this would inspire you to learn the basic of SQLs.

VLOOKUP

VLOOKUP is somewhat equivalent to JOIN in SQL.

Let’s say there are 2 CSV files that are associated to each other through employee_id.

employee_id name
1 John
2 Jane
3 Mary

and

employee_id salary
1 70000
2 80000
3 60000

We can write a SQL that will construct a table that contains name and salary on the same row:

SELECT 
  n.employee_id, 
  n.name,  
  s.salary 
FROM names n 
JOIN salaries s 
ON n.employee_id = s.employee_id

The above SQL yields:

employee_id name salary
1 John 70000
2 Jane 80000
3 Mary 60000

You can see that it’s just a short SQL to achieve what VLOOKUP can do.

Pivot tables

Pivot tables is somewhat equivalent to GROUP BY.

Let’s say there’s one CSV that contains sales per location, and you want to get total sales per location. Here’s the CSV:

state price
CA 10
CA 12
WA 17
WA 1
NY 9

We can write a SQL that will construct a table that group by state and compute the total sales of each state:

SELECT 
  state, 
  SUM(price) as price 
FROM sales 
GROUP BY state

The above SQL yields:

state price
CA 22
WA 18
NY 9

If you are familiar with SQL, this would take less than a minute to write.

What next?

Since you read up until this point, I know the question that you are having now is: “ok, I’m convinced. But where can I write SQL on CSVs?”.

I had asked the very same question, and there are many options like:

However, I ended up building my own Desktop application for this purpose: Superintendent.app, which supports Windows, Mac, and Linux.

I intend for Superintendent.app to be an SQL-focused spreadsheet Desktop app that offers great UX like Excel does (e.g. sort columns, sum values on the selected cells). It’s not there today, but we are moving toward that directly.

Since Superintendent.app is backed by SQLite, it is very fast (e.g. loading 1GB file within 10s on Macbook Pro 2020) and can handle a dataset as large as your memory allows.

While I want you to try Superintendent.app, the main gist is to show you how SQL can significantly improve your workflow, and hopefully you are inspired to start learning SQL.

Thank you for reading until the end, and I’ll see you next time :)

A programmable tooltip on Mac OS

Several months back, I’ve started a new job and immediately found myself perform some repetitive micro-workflows like:

I’d do these workflows many times a day. It was tiresome and required to be quite skilled (motor-wise) to perform these flows quickly.

So, I’ve decided to build Tip to perform these workflows faster.

Tip is built upon “Services” (or “System-wide Services”), which allowed one application to send a selected text to another application.

With this mechanism, Tip can be used with any Mac OS app!

Tip also only see the text when user explicitly triggers this feature by hitting the configured shortcut.

Tip gets the selected text, calls the user script, and renders the result in a tooltip as shown below:

Convert seconds from epoch to time and copy

Then, you can select an item to perform the defined action. Currently, Tip supports 2 actions: (1) copying to clipboard and (2) opening a URL, which can be used to trigger a Mac app that supports URL like opening a file in IntelliJ.

With a user script that you write yourself, this means you can customize Tip to do anything you want.

My favourite technical detail on Tip is that it runs in App Sandbox without requesting for additional permissions. Tip can only execute (and read-only) a user script placed in ~/Library/Application\ Scripts/tanin.tip/. Privacy-wise, this is a huge win.

I’ve been using Tip for 3-4 months now with many use cases. Surprisingly, the use case I use most frequently is: selecting a file in an error stacktrace and opening that file in IntelliJ/Github. The user script uses the text to search through local file and offers a tooltip item to open matching files on IntelliJ/Github. It looks like below:

Open file on Github from an error stacktrace line

I’d love for you to try Tip, and I hope it makes you move faster: https://github.com/tanin47/tip

How I take notes in workplace

I started taking notes in workplace a few years back. What nudged me to start was that I occasionally forgot about my action items after a meeting; I felt forgetful. Since then, I’ve gone through different processes of note taking, and today I want to share what my current process looks like.

I carry a small physical notebook with me everywhere. There are 3 categories that I always note: (1) to-do items, (2) questions, and (3) notable info. Then, later on, I transfer these notes into a digital form.

My notebook with a pen

My mini Rocketbook with 4-color Frixion pen. The notebook is erasable with wet towel.

Noting to-do items is the most important action. When someone asks me to do something, I’ll never want to forget about it. Noting the deadline is as important. I write it down.

Noting questions makes me more informed. If a question pops up in my mind, I note it immediately. No matter how dumb the question seems. Don’t even think about answering it. I write it down.

Though notable info sounds vague, I have a low threshold for it. If something surprises me or I didn’t know about it before, I write it down.

Here you might notice the theme, “just write it down”. I write it down without thinking. When I write down a to-do item, I don’t think about what it entails. When I write down a question, I don’t think about the answer. When I write down notable info, I don’t label it as good or bad. I want to avoid developing impression prematurely. Writing it down without judgement frees my mind from the worry (that I might forget) and allows me to focus on the conversation at hand.

One skill developed with the above process is “Suspended judgement”, the ability to acknowledge without judging. It’s something I’m looking forward to getting better at over time.

When I have free time, I would go through my notes. One by one, I would think about it deeply, transfer the item to a digital form, and cross it out. This is a good opportunity to spend time researching, thinking, and elaborating on each noted item.

Crossed items

The transferred items are crossed with the red ink.

Transferring notes from its physical form to the digital form is an extremely beneficial action. It forces me to review the notes; this yields one more occurrence of Spaced Repetition, which helps me retain important info. The solitude while transferring notes also allows me time to think more deeply about what I noted.

My note taking setup is rather simple. I use a mini Rocketbook with a 4-color Frixion pen. With this combination, I can reuse the notebook forever. I note in these 3 style of bullet points: · (dot) for tasks, ? (question mark) for question, and - (dash) for notable info. On the digital side, I store notes in plaintext and sync to Github using my homegrown app, Git Notes.

Taking notes is a very personal process. What works for me might not works for you. You’ll need to iterate at your own pace to figure out what you like and don’t like. My general advice is to approach it casually. Don’t stretch yourself too much. This is similar to practicing meditation, yoga, or anything alike. We want to deepen our practices, but we only move forward when we are comfortable.

Once we have this noting-thinking loop built into our daily routine, enriching our note taking practice becomes much easier. We can become more considerate by noting other people’s states of minds and later thinking about how that impacts what you do. We can become more aware of our mistakes by noting and later thinking about how to avoid them next time. The possibilities are endless.

It has been a few years already since I started taking note. I feel more grounded. I feel more thoughtful. I feel more confident retaining the info flowing through me. So, I encourage you to start taking notes in workplace, and I hope my process serves as one example that you can take and personalize to suit your style.

Happy noting!

The missing pieces of English pronunciation lesson

I’ve been thinking about my English pronunciation issue since I’ve moved to Bay Area (from Thailand) in 2012. I can’t seem to speak English properly.

There were many moments that made me feel bad about myself.

I can’t pronounce the word guitar. I don’t get it, and people don’t understand me when I pronounce this word. I once said guitar to a lady at the post office. She didn’t get it. I pronounced to her like several times in different stress patterns. She was very patient… but she still didn’t get. Eventually, I spelled it out for her. She got it and pronounced guitar back to me. I felt like that was exactly how I pronounced it.

Read more