How to Piss SQL Off, or The Downside of Doing Interesting Projects

How to Piss SQL Off, or The Downside of Doing Interesting Projects

It was a normal morning in red zone Cambodia. I heard that police checkpoints were becoming more numerous, though, even stopping food delivery drivers. I would need to think about this soon. Especially as I was short on mixers. Cocktails by the pool weren't as nice when your cocktail was a coffee mug of cheap vodka. I also should have done dishes more often. All of this could wait though. I had tournaments to enter into my database.

A lot of the motivation that contributed to, what I pompously consider to be, my solid learning pace when it has come to learning data science-related skills, has come from my disc golf database. One of the first and most regular pieces of advice I give to folks trying to learn data science is to find projects that interest you. If the project interests you, you will be more motivated to learn the skills needed to succeed in doing that project. When a bug seems to hold up everything, it will be your love for the subject matter that forces you to squash that bug. And I love me my disc golf. Doing interesting projects do have a dark side though. I'll get to that later.

So, entering tournaments into my disc golf database. After a few iterations, I have simplified the process of entering a tournament to a couple minutes each. My desire to make this process easier had led me to improve my use of classes and functions in python (see the "Do Interesting Projects" section above). It would be an even faster process if the Professional Disc Golf Association's API actually sent useful data, but it is what it is. So I was going through my normal robotic key strokes to enter this Finnish tourney in when... error. It was a weird time to have an error. I had done all the hard work and all I was doing was saving a dataframe as a table in a database.

Operational error. Something to do with SQL. "Too many SQL variables." Was it a SQLite problem? Didn't look like it as it seemed my pandasql queries were having issues also. It seemed that I had hit a magic number that makes SQL of many types angry. I was trying to pass queries with exactly 1000 variables.

A couple more details about my disc golf database and how I work the data (true up to yesterday). When I play with my data, I work from one large pandas dataframe. As of yesterday, this dataframe had 3759 rows and 1000 columns. After I make changes to the dataframe by adding a tournament or new metric, I save the dataframe directly into my database. The database included the one massive table and a bunch of other tables for each tournament, which had data that duplicated that in the big table. Some might call it the... PERFECT DATABASE. Others might not call it that. In any event... not so normalized.

But it worked fine (up to yesterday). I could enter data easily and get it out again. I could gather disc golf data and do disc golf-related stuff with it. And that's all I needed to do, so it was wonderful...

The downside of doing projects you're interested in....

When you are working on projects that interest you, the desire to get results (the purpose of the project) can overwhelm the desire to follow the best practices possible. If it works, it goes in. Even if we have not followed an ideal process, that fact that we are able to work with this interesting data and it can answer questions that we really want answered is enough. How we work on less interesting projects (to us) is different. When you are learning python, for example, and doing a project looking at the subject matter of Elon Musk's tweets (boring...), you will follow all the steps that your teacher, or model, followed, because that's the point. The process is the purpose, not the result. In the end, it doesn't matter what some rich clown tweets about. What matters is that you learned the right way to do something new.

I worked all day on making friends again with SQL. I had to change the database structure (give it a structure that is) and edit many of my functions. I think that, not surprisingly, things go a lot smoother now. I still work with the massive dataframe. But now when I save it, it breaks into smaller tables and SQLs into a much more reasonable, almost normalized I'd say, database. I wish I had done it like this from the start. Problem was that I just loved disc golf too much. Keep doing the projects you love. Just be sure to do them right.

It's time to get mixers.