I love carving out R&R time. It is time for “reflection and re-alignment” (in addition to rest and relaxation), and it always leaves me feeling refreshed and re-energized.
I try to reflect and re-align at the end of every year. But instead of doing the traditional “New Year’s Resolutions” (we all know how well those work), I’ve improved the process for better results.
I go through a process where reflection, not just resolution, is the core. Because to create our future, we must learn from the past. I also ask a few questions that serve the following purposes:
Create pride and gratitude, emotions shown to be associated with more perseverance.
Focus on what matters most, e.g. using the 80/20 principle.
Promote outside-the-box thinking to break out of normal thought patterns.
Major thanks to Tim Ferriss’s tips and all the research on how we achieve for inspiring my version of “New Year’s Resolutions”–or should I say “New Year’s Reflections”.
So what is my process?
I run through the reflection and goal-setting questions in a Google Doc template I created. You can copy and modify it for your own use.
Tips for R&R time
I do it least a few times a year, and want to do it once a quarter, so that I don’t veer too far the path I want to be on.
It’s best to try to leave your normal environment for at least a few days. Changing your environment changes your thought patterns.
Do nothing but R&R while away (or at least carve out a few days to do nothing but R&R on a longer trip). Save the minute-granularity itinerary planning, rushing from one destination to another, and adrenaline (or cortisol) producing activities for your other “vacations”.
What does one actually do to R&R? Here’s what I do: pen and paper journaling (stream of consciousness, about the past, about my dreams, anything), meditate, read, walk a lot in nature, eat, spend time by myself but also with loved ones.
Do you have any New Years rituals that help you start the year off right? Reach out and share them!
As 2018 comes to an end, I wanted to reflect and write down some of the things that have impacted me this year, and into the future. I made these thoughts brief, as I want to be concise and prioritize what had the most impact. Hopefully readers find my thoughts useful in a practical or thought provoking way. I’m happy to talk more about any of these topics, just reach out or comment!
The following thoughts are roughly categorized, and not in any particular order. Disclaimer: this page does not contain medical advice, every individual’s body and mind is different.
Stretching (before and after every workout) and continual rehab/strengthening (after every workout) has completely eliminated the re-emergence of weight training related injuries *knocks on wood*. As well as avoiding certain exercises that naturally aggravate old injuries. Stretching and softening muscles is one of Tom Brady’s secrets to his longevity. And LeBron’s too: “play hard, have fun, and stretch”.
Some of my favorite stretches and rehab/warmup exercises this year:
Medicine ball rolling (as an alternative to foam rolling), self explanatory
Zinc supplements have staved off oncoming colds several times for me this year. This is my go-to immune health supplement, which I “superdose” (i.e. 3-5 tablets a day) when I feel a cold coming.
A cup of coffee (caffeine) works wonders for me. I never realized how much more energy and alertness it gave me before this year, when I started drinking it more often because it’s free and tasty at work. I can only have one cup though, and earlier in the morning, or else I stay up all night. I’ve been using it together with L-theanine. I like to save this combo for special situations (also so that I don’t develop caffeine dependence and withdrawal).
Floating has helped me relax and stay centered. It’s also given me some thought provoking experiences. I like Lift in Brooklyn. Sign up for their mailing list, they have deals/coupons a few times every year.
I’ve really enjoyed Sam Harris’s Waking Up App. His meditations and lessons are educational and thought provoking, in addition to being very relaxing of course.
Speaking of Sam, I found his recent podcast with the TV mentalist and hypnotist Derren Brown fascinating; hypnosis can be powerful. I’m exploring self-hypnosis, as well as acupuncture, after hearing of a friend of a friend having allergies “cured” from it. I expect the placebo effect—namely the power of expectation and belief—to play a huge role in why these things “work”. Even if that’s true though, it means these practices can still be beneficial.
The mind and body are so connected that all of this might as well be under Health.
I continue to love building digital products that people use. Some of the things I created in 2018:
[in progress] ShiftReader: a better speed reading training tool than what Spreed was. The link is just a landing page with fake pricing (I’m doing price testing), so click “Sign Up” and enter your email if you’re interested in email updates.
[sorta dead] CryptoMint: was previously a paid subscription newsletter for crypto news with automated sentiment analysis on scraped articles, which actually had a good amount of subscribes. After deciding I did not want to be in the business of selling “predictions”, esp. in a market like crypto, I turned it into a free crypto newsletter (where the articles are still being scraped) that I only sometimes send out. I have about 430 people on the mailing list.
[dead] CryptoSaver: a web app that automated dollar cost averaging into crypto. I killed it after realizing that users were still terrified of some web app placing crypto buy orders automatically through Coinbase, even though it was via oauth, each buy order had to be manually approved, and that the app wouldn’t have any permissions to do anything else on the account like sell or transfer. I didn’t invest much before talking to users about this idea (and I try not to with most of my ideas): I only put up a legit looking landing page and did some light Python work to understand how the Coinbase API worked.
I’m really happy to have found “solo entrepreneurship” communities this year, like the Indie Hackers community and Microconf, and specific people in that community I can talk to, like Christian
I’ve been working at Squarespace as a Data Scientist for a little over a year and a half now, working closely to support Product. The thoughts below are primarily about that kind of Data Science, vs. machine learning engineering type roles, or Data Scientists that support other stakeholders like Marketing or Sales. I’ve gotten a good chance to learn and think about:
How Data Scientists and PMs should work together: more of a partnership and less of a conduit for data access. Like any good relationship, it takes time and effort to develop that partnership.
Event data standardization, event tracking “grammars” that are intuitive and self documenting, and the importance of data governance in a truly data-driven organization. And by data-driven orgs I mean orgs that use data (and Data Science) in a meaningful way to drive product-level and even company strategy-level decisions, not an org that only looks at if metrics are going up. 📈 Like all things in life, a balance of both is necessary.
The power of quantitative + qualitative research in understanding users i.e. what Data Scientists (can) do + what User Researchers do. Data shows what users do. User interviews get at why users do what they do, or what they couldn’t do (which you can’t observe with data). Together, they are the voice of the user.
I’m very bullish on Segment, and the massive and growing value they provide for Product orgs that want to be data driven (which is also a growing number). For example, I love what they’ve created with Protocols and Typewriter. Now that they’re the centralized data hub for companies, they can build powerful analytical products like Personas too.
As always, you can follow along with what I’m reading on Goodreads.
A few of the most impactful ones I read this year:
Understanding your users is the best way to continue building a product that they want and ultimately cannot live without. One way to better understand your users and how they experience your product is to talk to them or survey them; another way is to dig into data on how they’re using your product–what actions are they taking, how much do they come back to use your product–to gain insight into how you might be able to improve their experience. This is often called “product analytics”. While preparing my first iOS app for release, I thought about how I might track user behavior in my app so that I’d have the data needed to explore how they’re using it, such as what buttons they’re pressing, what screens they’re visiting, etc. In web development projects, I’ve traditionally relied on tools like Mixpanel to track events easily and explore and visualize user behavior in different ways, but Mixpanel has been too limiting and expensive, so I decided to go with a cheaper and more flexible solution (but less user friendly on the visualization side of things) for mobile app event tracking and analytics, Google Analytics for Firebase. We all make mistakes when using any new tool, but I came across some nuances of Google Analytics for Firebase (Firebase Analytics for short) that I wish I knew about before I started using it. Here is a list and short description of each, which will hopefully help new users of Firebase Analytics learn from the mistakes I made.
List of Firebase Analytics Nuances
“Turn on” parameter reporting from the start if you have dimensions in your events that you want to see numbers for, at a glance, in Firebase.
Link Firebase to BigQuery from the start if you want access to your raw event data.
Firebase’s default Funnel reports are “open” funnels, not “closed” funnels.
“Turn on” parameter reporting from the start if you have dimensions in your events that you want to see numbers for, at a glance, in Firebase Analytics.
Firebase Analytics gives you some basic visualizations out of the box, like how many times a certain event fires, over time. I had an event that would fire whenever an upgrade popup was shown to a user, and I specified a parameter called “source” which would note which action preceded the upgrade screen, so I could see the most common paid features that free-tier users tried to access. However, Firebase Analytics did not report on this “source” dimension at all until I manually set up “parameter reporting” for it. So don’t forget to enable “parameter reporting” for important event parameters/dimensions that you care about!
In the Event view, click the three vertical dots to the far right of your event, then add a parameter of your event to the table by clicking and dragging
Firebase Analytics will start collecting numbers for that parameter (here, “source”), which you’ll be able to see in the report for the parent event (here, “upgrade_popup_show”)
Link Firebase to BigQuery from the start if you want access to your raw event data.
By default, your raw event data is collected and made available to you only after you link Firebase to BigQuery. When I first implemented Firebase, launched my app, and got a handful of users, I could see a high level picture of their behavior via Firebase Analytics’ basic visualizations. A few weeks later, I found out that I had to link Firebase to BigQuery explicitly to start telling Firebase to “save” my raw event data, and only after doing so did I see that raw data coming in (and saved into tables in BigQuery). So I had “lost” the first several weeks of raw event data, which isn’t bad for my small app, but could be more costly for a high profile, heavily marketed app launch where mobile analytics and being able to mine insights from the data matters more.
Note that when you link Firebase to BigQuery, you’ll need to upgrade to Google Cloud Platform’s Blaze plan, which is a pay-as-you-go, or pay only for the bandwidth, storage, etc. that you use, plan. You can visit their calculator to estimate your costs, but so far, collecting the data and running infrequent BigQuery SQL queries for my app has been free.
Firebase’s default Funnel reports are “open” funnels, not “closed” funnels.
If you go into Firebase Analytics’ Funnels page, you’ll see an area where you can create a funnel easily. After trying to do so, I found out that the funnels Firebase creates are “open” funnels, meaning that at each step of the funnel, a user doesn’t have to have completed the previous step of the funnel to be included in the count of that step. In my opinion, “closed” funnels, where at each step of a funnel a user at that step has to have completed the preceding step, are much more informative; it’s also a core feature of other event analytics tools like Mixpanel and Heap. Several others are also confused about Google’s decision to have Firebase only report open funnels.
For example, I created a funnel in Firebase Analytics to report on what percentage of users who open my app for the first time go on to take their 1st photo with my app, then what percentage of those go on to take their 2nd photo, etc. I expected fewer and fewer users to make it to each step of the funnel, so was surprised when I saw what appeared to be 100% of users who take one photo take two, 100% of users who take two photos take three, etc. Until I found out that Firebase had constructed an open funnel:
There isn’t a setting in Firebase Analytics to see closed funnels yet, so I decided to create a closed funnel in BigQuery with SQL, on my raw event data.
I won’t go into the details here, but I tested a few different kinds of SQL queries for constructing closed funnels, and the following “LEFT JOIN”-based one had much better performance than a “subqueries”-based one that you may find elsewhere on the internet. You too can create closed funnels to better understand the flow of your users, if your event data is in BigQuery: here’s my SQL query for the closed funnel “first open -> take 1st photo -> take 2nd photo -> take 3rd photo” (using UNNEST to flatten arrays because BigQuery stores stuff like that):
count(distinct e0.user_dim.app_info.app_instance_id) as first_openers
, count(distinct e1_user) as photo_taken_1
, count(distinct e2_user) as photo_taken_2
, count(distinct e3_user) as photo_taken_3
FROM `youday_IOS.app_events_*` as e0, UNNEST (e0.event_dim) as e0_events
LEFT JOIN (
events.name as e1_eventname
, e.user_dim.app_info.app_instance_id as e1_user
, events.timestamp_micros as e1_ts
FROM `youday_IOS.app_events_*` as e, UNNEST (e.event_dim) as events
) ON e0.user_dim.app_info.app_instance_id = e1_user
AND e1_eventname = 'add_photo_from_camera'
LEFT JOIN (
events.name as e2_eventname
, e.user_dim.app_info.app_instance_id as e2_user
, events.timestamp_micros as e2_ts
FROM `youday_IOS.app_events_*` as e, UNNEST (e.event_dim) as events
) ON e1_user = e2_user
AND e2_eventname = 'add_photo_from_camera'
AND e2_ts > e1_ts -- 2nd photo taken after 1st
LEFT JOIN (
events.name as e3_eventname
, e.user_dim.app_info.app_instance_id as e3_user
, events.timestamp_micros as e3_ts
FROM `youday_IOS.app_events_*` as e, UNNEST (e.event_dim) as events
) ON e2_user = e3_user
AND e3_eventname = 'add_photo_from_camera'
AND e3_ts > e2_ts -- 3rd photo taken after 2nd
WHERE e0_events.name = 'first_open'
Firebase for Mobile Product Analytics
Firebase makes it easy to track events and collect all of them into a datastore, so you have the data you need to (quantitatively) understand how users are using your mobile app. There are just a few “manual switches” that someone using Firebase Analytics should know about, to ensure that they’re collecting complete behavioral data from the start. Firebase can also improve its visualizations to be more informative and insightful, so users don’t have to write SQL as much. Firebase certainly has the potential to get there, with its relatively affordable “utility” or “pay-as-you-go” pricing model and its superior data storage and querying capabilities (good luck trying to get your raw data out of the other event analytics platforms). I enjoy learning from my users to build a better product, and having the data to do so, and am excited to see what Firebase Analytics can do over time for the advancement of product analytics.
I’ve been reading Data Science for Business, by Provost and Fawcett, a very useful book that explains some of the most important principles and topics in data science. The authors’ language and structure helps a lot in developing an intuitive understanding of key data science concepts like model tuning, model evaluation, and various models themselves like decision trees, linear models, and k nearest neighbors. I highly recommend the book if you’re someone who works with data scientists, if you’re a beginner data scientist, or even if you’re a data science expert who’s looking for a good resource to refresh your fundamentals with.
I found this one chapter particularly interesting because it talks about a framework, or way of thinking, that I haven’t really heard about elsewhere. While specific tactics, such as how different kinds of models work, are definitely important and a large part of what a Data Scientist needs to know and be able to do, I think higher level strategy is also important. Anyways, the framework is highly practical, which fits the authors’ theme for the book: that data science isn’t just about analyzing data, but also about understanding the business problem in an analytical way. I wished there was something tangible and interactive to go along with their explanations in this chapter (and others), so I decided to create a guide of sorts, this blog post plus an interactive Jupyter Notebook you can download and play with. The blog post provides context if you haven’t read the corresponding chapter in the book yet, so the Jupyter Notebook is near the end.
If you have the book already, this blog post corresponds to the latter “half” of Chapter 7, “Decision Analytic Thinking I: What Makes a Good Model?”. This guide and especially the Jupyter Notebook assumes that the reader already has some familiarity with the basic ideas of machine learning, such as supervised learning (specifically classification), data pre-processing, holdout set testing, and model evaluation.
When applying data science to solve business problems: what is the real goal?
Like approaching any sort of problem, you have to uncover what the real goal of a data analytic project is. It can be tempting to get caught up with the surface level question or jump straight into solutions.
For example, questions about customers come up a lot in business: which customers are most likely to churn? Which customers are most receptive to upselling? The idea is that once we can predict which customers are most likely to be upsold, we can call them, try to get them to buy more items like an add-on for the thingamajig they just bought, and generate more revenue for the business. Let’s run with this “upselling” case as an example.
The real business goal for answering “which customers are most receptive to upselling?” is so that we can not only generate more revenue from upselling customers, but also maximize the profit generated from our efforts. Not all customers will be equally likely to be upsold (some are curmudgeons, others might have a real need for the other products we’re selling), those who we do upsell could purchase different amounts of stuff, and the act of upselling costs us time and money (which can also be variable). So how do we even structure a problem like this, and then decide what to do?
Introduction to the expected value framework, and how it helps break down problems
Let’s introduce the expected value framework, and weave it into how we’d structure and break down our business objective for this “upselling” project.
As a quick refresher:
expected value (of a variable) – a predicted value of a variable, calculated as the sum of all possible values, each multiplied by the probability of its occurrence
Basically, what do we anticipate, or expect, the value of some variable to be, given that there is some uncertainty in the chances of different outcomes happening.
Frame the question in terms of expected value
Back to our upselling question. Each customer has his/her own probability of being upsold, and likely amount that they will be upsold for; there’s also a cost to upselling, which we may have to eat if we call a customer who doesn’t want to buy anything else from us. So, thinking in terms of expected value, each customer will have an expected profit, given that we reach out to that customer to try and upsell them. More specifically:
Which means that, assuming we reach out to a customer, the expected value of profit () equals the probability of upselling the customer () times the profit we’d get from upselling the customer, plus the probability of failing to upsell the customer (1 minus the probability of upselling the customer) times the profit we’d get from failing to upsell the customer.
Breaking out profit in each potential outcome:
Where is the value, or revenue generated, from upselling the customer, and is the cost of trying to upsell the customer (we assume the cost is constant across customers for simplicity). Notice in the second half of the equation that if we fail to upsell the customer, the outcome is that we get $0 in revenue and eat the cost () of trying.
Now, the path to obtaining our original business goal, to maximize total profits, is clear: try to upsell all customers where the expected profit of trying to upsell each one is greater than 0 (assuming we don’t have any budget or constraint on how many customers we can upsell to).
Expected value breaks the problem down for us
Also, thinking in terms of expected value has now broken up the problem nicely for us: to figure out the expected profit of trying to upsell a customer, (1) figure out the probability that upselling will work , the (2) value of a successful upsell , and the (3) cost of trying to upsell a customer.
Now, we can go more low level and think about how we might address each piece analytically. We can build a machine learning model, a classifier, on historical customer data of which kinds of customers were successfully upsold and which kinds weren’t, to address (1) and generate a predicted , or probability that upselling will work, for each customer. For simplicity, we’ll assume that both (2) and (3) are constant are constant across all customers, but technically, you could build another model to predict (2), the value of a successful upsell for a given customer.
More specifically, for (1), our historical customer data is a snapshot of all customers that we’ve previously tried to upsell to, at time t. One column in the data is whether or not (e.g. a 1 or -1, or 1 or 0) we were able to successfully upsell each customer by some future date t+1, say 3 months later; this is the target variable. The other columns, or features, contain data on each customer before time t, such as number of previous purchases, number of times customer has been back to our online store, shipping zip code (which we can estimate income level with), etc.
Now we have a structure, thanks to EV (expected value), for evaluating whether we should try to upsell any individual customer in order to maximize company profits.
Let’s plug in some numbers to see how we might use our structure to make decisions on whether we should try to upsell a customer or not.
Take Customer A. Based off of what we know about other customers that are similar to him, our machine learning model predicts that he has a 91% chance of being upsold, if we call him.
Let’s assume that if we upsell a customer, they will spend $100 to buy an add-on to the thingamajig they already bought. Let’s also assume that on average, it takes a 30 minute phone call at a salesperson’s hourly wage of $30 / hour, to try to upsell someone, so the cost of upselling is $15.
Therefore, the expected profit for trying to upsell Customer A will be:
And since the expected profit is positive, it is worth it to try and upsell him, because on average (if we keep trying to upsell people like him), we will generate $76 in profits each time for the company.
Now let’s look at Customer B. Based off of what we know about other customers that are similar to her, our machine learning model predicts that she has a 4% chance of being upsold, if we call her.
So, the expected profit for trying to upsell Customer B will be:
We should not try to upsell customers like Customer B, because on average, we will lose $11 each time.
If we do this expected value calculation for each customer we’re thinking about upselling to, we can arrive at a subset of customers where the expected profit of upselling each one is positive, and thus if we try to upsell all of them, our expected total profit will be maximized.
See this Jupyter Notebook for a full example of training a machine learning model on historical customer data to predict whether or not a customer will be upsold or not, and the associated probabilities of each outcome happening. These probabilities, along with the expected value framework, are then used to show which customers we should try to upsell to maximize our company’s profit.
Note that using the expected value framework to calculate something like expected profit depends entirely on two things: the probabilities of different outcomes (e.g. a customer successfully being upsold or not) and the benefit or cost of each outcome. Both can be estimated with models and comprehensive data, but not always very well, or it may be impossible in the first place. This is where both business and data understanding come into play: a data scientist has to understand what data is available and what it can be used for, and also understand how the business works so that accurate cost/benefit numbers can be gathered. This also means that the results of using expected value are sensitive to changes in either type of variable, probabilities or cost/benefit numbers. Though the expected value framework can be a practical and structured way to break down a business analytic problem, the data scientist may have to use other methods to inform action if he/she doesn’t have enough confidence in the probability or cost/benefit estimates. Like all things in life, there is no one size fits all approach: the EV framework is a tool in a data scientist’s big toolbox.
Thanks for reading, I’m always open to questions, suggestions, or other kinds of feedback!
We all know how hard making decisions about own own lives can be sometimes, such as decisions about your career, or your relationships.
Here’s a list of several thought experiments I’ve come across over the years that have personally given me more perspective, making hard decision making a little bit easier sometimes. Though they’re all slightly different, they seem to operate similarly, cutting out fear and external influences to drill into what our deepest personal values are.
Ruth Chang’s idea that every hard choice is an opportunity to “become the authors of our own lives”. Watch her full TED Talk (15 minutes), it’s amazing.
I’m not sure if any of these will always give the “right” answer, and I also think that these thought experiments are just part of the puzzle to improve decision making about one’s own life. As Kahneman, Mauboussin, and Munger suggest, we should use a rational decision making framework or even a checklist* because humans are very prone to cognitive biases and shortcuts that can lead to bad decisions. Even as just a piece of the puzzle, these thought experiments have allowed me to think about decisions from different perspectives, which is always valuable.
Please add any other relevant thought experiments, and/or thoughts about decision making!
*I personally use a checklist similar to WRAP, which is simple to remember and covers a majority of the most common cognitive traps we can fall into. The Heath brothers describe WRAP more in Decisive. Using their terminology, the above thought experiments could belong to the “A” step of WRAP, or “attaining distance/perspective”.
One of the side projects I worked on in the past handful of months was Mr. Market Feels: a stock market sentiment Twitter bot that used automated image processing to extract and tweet the value of CNN Money’s Fear and Greed Index every day.
There have been attempts to backtest the predictive power of the Fear and Greed Index when buying and selling the overall stock market index depending on the value (the results suggest there isn’t much much edge for that particular strategy). Anecdotally though, I’ve found the CNN Fear and Greed Index (what I’ll call FGI for short) to be a pretty good indicator of when this bull market has bottomed out during a short-term retracement, and when I used to have more time, have used it to trade options with decent success. Going to CNN’s website every day to check the FGI was a pain, and I also wanted the numerical values in case I wanted to run some analyses in the future, so I wondered if I could automatically extract the daily Fear and Greed Index values.
I saw this as a fun and short coding project that would help me and others while giving me practice with image processing, so I dove in.
The goal was to extract the FGI “value” and “label” from CNN’s site every day. The value of the Index is 95 and the label is “Extreme Greed” in the screenshot of the FGI below:
Extracting the FGI value and label isn’t as easy as using OCR (optical character recognition) on the image and getting the results: for one, there is a lot of extraneous text in the image. Two: the pixel location of the value and label that we want changes as the FGI changes. Three: the relative position of the value and label also changes as the FGI changes. You can see points two and three in the image below: now, the FGI label (“Extreme Fear”) is to the top left of the FGI value (1). In the original image, the FGI label (“Neutral”) is directly right of the FGI value (53).
Why does all of this matter? Because for clean OCR, images need to be standardized. Or at least they do for Tesseract, the open source OCR engine created by Google. In Tesseract’s case, images of text shouldn’t contain any other artifacts (that the engine might try to interpret as text), should be scaled large enough, have as much image contrast as possible (e.g. black text on white), and be either horizontally or vertically aligned.
Most of the pre-processing of the FGI images to standardize them for Tesseract was straight forward enough. Without going into way too much detail, I used the Python Pillow library to automatically convert the image to black and white, apply image masks to eliminate extraneous parts of the image–like the “speed dial” and the “historical FGI table” on the right hand side–and crop the image down leave only the FGI value and label, like this:
Here’s where challenge number three came up: the FGI value and label aren’t always either horizontally or vertically aligned, and this reduced Tesseract’s accuracy. For example, in the first image, the FGI label is diagonal from the FGI value. Running Tesseract OCR on it returns “NOW:[newline]Extreme[newline]Fear”, which completely misses the value “10” because of the diagonal alignment. You can try out Tesseract OCR with the above images, or with your own, here.
An Interdisciplinary Solution of Sorts
One solution to the challenge above split the resulting image into two images, one with the FGI value and a separate one with the label, so that Tesseract could be run on both and know that both images were either horizontally or vertically aligned. Basically, from a single FGI image, I wanted two images that looked like these:
In thinking about ways to implement that, I first thought about the principles of unsupervised clustering, from the field of machine learning. With clustering, the intermediate, processed FGI image could be segmented and split appropriately by finding the cluster of pixels that corresponded to the FGI value (“10”), and the other cluster of pixels that corresponded to the FGI label (“Now: Extreme Fear”).
Turns out that using the k-means clustering algorithm for image segmentation is pretty common practice.
First, a copy of the image was “pixelated” to ensure that the k-means algorithm would converge on the two correct clusters:
Then, the code applied k-means to find the centroids of the two clusters (green dots). It then derived the line connecting the two centroids (green line), and calculated the bisecting perpendicular line (red line), which can be seen as a “partition” between the two clusters of black pixels.
From there, the original black and white FGI image could be split along the partition line, which would result in the desired two images: one for the FGI value and one for the FGI label. From here, Tesseract would have these two standardized images as inputs and would be able to cleanly extract the FGI value and label.
Lastly, I put the script onto a web server, told a cron job to run it daily, and hooked it up to Twitter’s API to automatically post to the Twitter account Mr. Market Feels. I named it after Ben Graham’s moody Mr. Market.
I just finished reading Poor Charlie’s Almanack (an amazing book full of wisdom and life principles) so Charlie Munger’s multidisciplinary approach to life is on my mind. Though this project was probably a little less multidisciplinary than he means because machine learning and image processing are closely related fields, I still saw it as an example of how broad and varied knowledge and skills can come together to solve a problem effectively. To quote Munger on specialized knowledge: “To the man with only a hammer, every problem looks like a nail.”
Thanks for reading!
UPDATE 6/9/2018: Mr. Market Feels has been been broken for a handful of months because various financial data APIs that I’ve tried using have been deprecated. I recently found out about IEX’s free and publicly available financial data API, which Mr. Market Feels is now using and will hopefully make its first tweet post-fix on Monday. I would also highly recommend reading Flash Boys: Michael Lewis tells such an intriguing story about the arms race going on in high frequency trading and the birth of IEX.
In my downtime, I’ve been using Kaggle to get better at applying machine learning to solve problems. The process is not only teaching me new technical skills, but also reminding me of some useful principles that can be applied elsewhere. To keep things digestible, this is the second post of two (the first one is here).
A short list of important skills for a data scientist
When trying to get better at a skill, I try to tackle the highest leverage points–here’s what I’ve been able to gather about three skills that are important in being a data scientist*, from talking with others and reading about machine learning, and experiencing it firsthand with the client projects I do.
Communication (includes visualization)
The first two are relatively self-explanatory, ensembling brings some pretty interesting concepts that apply to decision-making, in my opinion.
*I’ll be referring to the “applier of machine learning” aspect of “data science”.
Feature engineering is the process of cleaning, transforming, combining, disaggregating, etc. your data to improve your machine learning model’s predictive performance. Essentially, you’re using existing data to come up with new representations of the data in the hopes of providing more signal to the model–feature selection is removing less useful features, thus feeding the model less noise, which is also good. The practitioner’s own domain knowledge and experience is used a lot here to engineer features in a way that will improve the model’s performance instead of hurt it.
There are a few tactics that can be generally applied to engineer better features, such as normalizing the data to help certain kinds of machine learning models perform better. But usually, the largest “lift” in performance comes from engineering features in a way that’s specific to the domain or even problem.
An example is using someone’s financial data to predict likelihood of default, on a loan for example. You might have the person’s annual income and monthly debt payments (e.g. for auto loans, mortgages, credit cards, the new loan they’re applying for), but those somewhat closer to the lending industry will tell you that a “debt to income ratio” is a better metric for predicting default, because it essentially measures how capable the person is of paying of his/her debt, all in one number. After calculating it, a data scientist would add this feature to the training data, and would find that their machine learning model performs better at predicting default.
As such, feature engineering (and in fact, most of machine learning) is sort of an art vs. a science, where a creative spark for an innovative way to engineer a domain specific feature is more effective than hard and fast rules. They say feature engineering can’t be taught from books, only experience, which is why I think Kaggle is in an interesting position because they’re essentially crowdsourcing the best machine learning methodologies for all sorts of problems and domains. There’s a treasure trove of knowledge on there, and if structured a little better, Kaggle could contribute a lot to machine learning education.
What potentially useful features/data could we engineer from timestamp strings? We could generate year, month, day, day of week, etc. numeric data columns–much more readable by a machine learning model.
During a recent chat with one of the core developers of the Python scikit-learn package, I asked what he thought some of the most important skills for a data scientist are. I sort of expected technical skills, but one of the first things that came up was communication, or being able to convey findings and why those findings matter to both internal and external stakeholders, like customers. This one’s self explanatory–what good is data if you can’t act upon it.
In fact, it seems like communicating well for data scientists might be even more important than it is for professions like programmers or designers because there’s a larger gap between result and action. For example, with a design or app, a decision maker can look at it or play around with it do understand it reasonably well to make decision, whereas a decision maker usually can’t just see a bunch of numbers that were spit out by a machine learning model and know what to do: how are those numbers actionable, why should someone believe those numbers, etc. Visualization is a piece of this, as it’s choosing the right charts, design, etc. to communicate your data’s message most effectively.
In machine learning, an ensemble is a collection of models that can be combined into something that performs better than the individual models.
An example: one way this is done is via the voting method. The different base, or “level 0”, models each make a prediction on, say, whether a person is going to go into default in the next 90 days. Model A predicts “yes”, model B predicts “yes”, and model C predicts “no”. The final decision then becomes the majority vote, here “yes”.
There are many other ways of ensembling models together. An important and powerful one is called stacking, and it is applying another machine learning model–called a “generalizer”, or “level 1” model–on the predictions of the base models themselves. This is better than the voting method because you’re letting the level 1 machine learning model decide which level 0 models to believe more than others based on the training data you feed into the system, instead of arbitrarily saying “the majority rules”.
A high level flow chart of how stacking works.
Ensembling is a key technique in machine learning to improve predictive performance. Why does it work? We all have an intuitive understanding for why it should work, because it’s a decision making framework we all have probably used, or been a part of, before. Different people know different things, and so may make different decisions given a particular problem. When we combine them in some way–like a majority vote in Congress or at the company we work at–we “diversify” away the potential biases and randomness that comes from just following one decision maker. Then, if you add in some mechanism to learn which decision makers should have their decisions weighed more than others based off of past performance, the system can become even more predictive–what areas could benefit from this improved, performance based decision-making process?*
*Proprietary trading companies, where every trade is a data point and thus generated very frequently, do this more intelligent way of ensembling, in a way, by allocating more money to traders who’ve performed better than others historically. A trader who is maybe slightly profitable but makes uncorrelated trades–for example by trading in another asset class–will still be given a decently sized allocation, because his trades hedge other traders’ trades, thus improving the overall performance of the prop trading company. Analogously, in machine learning, ensembling models that make uncorrelated predictions improves overall predictive performance.
Here are some resources related to the topics described above that were recommended to me and that I found most useful, I hope they’re helpful to you too.
A good overview of the principles of data science and machine learning for non-technical and technical folk alike: Data Science for Business
An important thing for a data scientist to have before any of the stuff above is a good understanding of statistics, Elements of Statistical Learning is a detailed survey of the statistical underpinnings of machine learning.
In my downtime, I’ve been using Kaggle to get better at applying machine learning to solve problems. The process is not only teaching me new technical skills, but also reminding me of some useful principles that can be applied elsewhere. To keep things digestible, this is the first post of two.
Deliberate practice, with Kaggle
Deliberate practice–practice that is repeatable, hard, and has fast feedback (e.g. with a coach)–is needed to master any skill. Kaggle provides a great medium for machine learning deliberate practice: you can still solve the problems that were for old competitions, read about what the top performers did, and get instant feedback on how well your machine learning model performed vs. other peoples’.
Aside from accessible deliberate practice, self-learning this way has another big benefit over some of the in-person data science/machine learning classes I’ve observed: the student has control. I can learn as fast or as slow as I need to. I can learn about what I want: not only about what I find most interesting, but about what the top performers on Kaggle and other experts are doing to be successful.
I attempt to solve a machine learning problem on Kaggle, see how I performed, read about and take notes on what the top performers did, and fill in my knowledge gaps with lots of research on Google, continuously cycling between writing down questions about new terms or concepts that come up and answering them. The self-paced, deliberate nature of this learning avoids what Sal Khan calls “Swiss cheese gaps” in education–though of course, it is up to the learner him/herself to stay disciplined and engaged.
The “cycle” of deliberate practice described. Important things to note: it is closed, which allows for the learning from feedback, and it is fast, which allows for that learning to happen quickly, and to be timely.
Something like Khan Academy provides a great structure for self-paced, deliberate-practice-oriented learning for more “traditional” academic topics. I see opportunity for more things like it, in other educational areas. Also, if anyone has found any helpful tools for self-learning, would love to hear about them. I personally use a lot of Google Docs for note-taking, mind42 for topic hierarchies, pinboard to keep track of my online research, sometimes Quizlet to help me memorize things.
Next: 80/20-ing machine learning
In the next post, I will get slightly more technical and into some of the “highest leverage” machine learning concepts and skills, as well as share some resources (including advice from one of the most helpful machine learning educators and practitioners I’ve had the pleasure to interact with). There should also be at least one principle/mental model for those less interested in the technicals of machine learning. As always, please be critical and feel free to discuss anything and everything, I love learning from other perspectives.
For a few months, on nights and weekends while working at my most recent job, I worked on a project to help make clinical trials more efficient, and even built a prototype (the screenshot above, you can play around with it here)–I gave it the memorable and exciting name “Clinical Research Network”.
Though my project didn’t “succeed” in the traditional sense, I learned a lot about this interesting area of health/biotech, and got to practice several important product development skills. The following are the important parts of my story, but warning, it’s still a long post.
Clinical trials have a hard time recruiting enough patients, which causes a lot of waste.
I received an email from HeroX one day about a competition to see who could come up with the best idea to help clinical trials recruit more patients. Intrigued, I did more research on the problem, and decided to enter the competition: worst case I would spend a little time writing a proposal that didn’t win, but still get to learn more about this fascinating problem.
As discussed in a previous post, roughly 10% of clinical trials terminate unsuccessfully because they’re unable to recruit enough patients for the study. There are roughly a thousand new clinical trials every year, and since a clinical trial costs on average $30M-$40M, a lot of money is spent on clinical trials that don’t end up contributing much to the advancement of science and medicine.*
The HeroX competition’s more quantifiable goal was to come up with ideas that could double the patient recruitment rate from 3% to 6%, patient recruitment rate being defined as number of patients who participate in clinical trials / total number of patients out there. The more patients participate in clinical trials, the faster medical research accelerates.
*The numbers used to “size up” the problem are very rough, and taken from various sources. My model also did not account for the fact that a lot of clinical trials that do complete successfully still have trouble recruiting patients fast enough, so go way over-schedule and over-budget. But the order of magnitude should be close. See the model for more details.
Questioning assumptions, asking why
The problem was framed so that solutions tackling recruitment first came to mind e.g. increasing patient awareness of clinical trials through tools, advertising, etc., connecting patients to clinical trials automatically by leveraging EMR data.
But I wanted to understand the problem at a deeper level, vs. taking things at face value. I put together a simple model in Google Sheets and let the numbers shed some light on the problem. Interestingly, even if all clinical trials were able to recruit enough patients with a wave of a magical wand, the patient recruitment rate would only increase by 4%, much less than the competition’s desired 100% increase, or doubling, of the patient recruitment rate. This suggests that if we really want to accelerate medical research and get more of the patient population to participate in clinical trials, we’re not only going to need to recruit patients better, but we’ll also need a lot more clinical trials, clinical trials that happen faster and more efficiently.
I wrote a proposal for the competition, submitted it, and…
What idea did I submit?
An idea for a SaaS product that would mine/learn from all the data we have on previous clinical trials (a lot of it public), and help pharmaceutical companies and investigators learn from the past. This product would essentially be a search engine on top of a “similarity graph”, where pharma and/or doctors/investigators could describe their clinical trial, and see other trials that were similar in some way (perhaps disease treated, or similar inclusion/exclusion criteria), and learn from what made those clinical trials succeed or fail.
Why did I submit that?
There’s a lot of data out there on clinical trials, even publicly available data like clinicaltrials.gov. There has to be some sort of knowledge we can learn from all the clinical trials we’ve already conducted, from both the successes and failures.
Clinical trials face many different obstacles to recruiting patients, mostly because they themselves are very different–different populations, different diseases, different treatments, different investigators running the trial, different locations. But this doesn’t mean that trials aren’t similar to other trials in some way, so something that worked for one trial could also work for another, depending on how they’re similar.
As mentioned before, I realized that the actual clinical trial process needs to be faster, more efficient, and cheaper to drive a meaningful acceleration of medical research. This was a tool that pharma and investigators/doctors could use to both plan and run a clinical trial more efficiently.
My idea didn’t win any of the prizes for the competition, but that’s ok.
If interested, you can see the winning entries (as well as the “top 10”, not sure where all the other entries went).
Getting out of the office
I asked for feedback on how my entry was judged, but didn’t get anything back. Still following my curiosity for the problem, I decided to talk to more people actually involved in clinical trials–I had originally found out about the competition two weeks before the deadline, so given some more time I felt I could come up with something more useful.
I developed a script to scrape clinicaltrials.gov for investigator contact info, and was able to gather a good list of physicians in the NYC area. I also used Mechanical Turk to fill in what I wasn’t able to scrape, such as a doctor’s research institution. After writing a bunch of emails to request to meet, one doctor actually got back to me! After that it was a bit easier, as I would ask the doctors if they knew anyone else I could talk to, and also name-drop the institutions I had visited already. I got to speak to a couple ex-pharma individuals from this effort too.
The two biggest things I learned from speaking to the handful of physicians and ex-pharma folk:
Physicians don’t really talk to and learn from each other when it comes to clinical trials, e.g. about patient recruitment best practices. They’re extremely busy, and there isn’t really an incentive to help another physician who may be seen as a “competitor” (both in terms of revenue and research).
Though investigators (physicians) recruit patients for a clinical trial, pharma and “contract research organizations” (CROs) recruit the investigators to run a clinical trial (among a ton of other stuff to set up and support the trial). It seemed that industry’s methods for investigator selection were pretty manual: they would rely on their own personal, immediate networks, maybe look at which investigators they worked with in the past.
Building something fast
I decided to build an MVP that was based on my learnings. There’s a lot that can be improved in the clinical trials process, so I thought about leverage, and a decision tree: decisions made earlier in a process can have a big impact on the decisions made later. This early task of “investigator selection” that pharma does when setting up a clinical trial (point 2) sounded like a good one to try and tackle with technology. It also isn’t something that investigators themselves are super concerned with, which would get around the obstacles discovered in point 1. There’s a lot of public data out there on clinical trials (clinicaltrials.gov) and research that came out of the trials (PubMed), so I wanted my tool to leverage this data.
I threw together something really quickly using Flask, the python framework. Use cases: pharma could type in a drug and find the researchers who published the most research on that drug–those physicians might be good candidates as investigators for a clinical trial that used that drug (to perhaps treat a different disease). Patients could type in the disease they had and find the physicians who were perhaps the most knowledgable on that disease. On the backend, data was scraped from PubMed, and essentially just restructured to be more useful for this particular case.
I started showing the “Clinical Research Network” to people in the biotech space to see what they thought…
…and I quickly found out that several companies, both small and large, were tackling this exact problem. They had way better credentials, more money, and free snacks at the office–how can I compete with free snacks?
So I put this project on hold, mulled over the possibility of working for them, and decided to move onto other ideas I was thinking about. I like writing post-mortems for my projects, and one of the biggest learnings was that I seemed to have “overextended” myself in a sense: I felt like my struggle was a very steep uphill climb from the beginning because I didn’t have the industry credentials and I didn’t yet have the industry network, very important aspects in an industry like biotech and healthcare.
Overall, the project was a great learning experience, and I got to practice several problem solving skills I find powerful and fun.
My friend Jesse introduced me the Open Payments Dataset, which tracks the details of all payments made by “applicable” healthcare manufacturers (like pharmaceutical companies, medical device manufacturers) to any doctor they work with. A federal program maintains this database, which is a product of the Sunshine Act, part of the Affordable Care Act.
Why does this database exist? Basically because of the incentives created by industry being able to pay doctors to work on things that will ultimately help industry–like new drugs or medical devices. The hope is that more transparency will reduce any harmful influence that industry could have on medical research, education, and clinical decision making. In the words of Senator Grassley, co-author of the Sunshine Act:
Disclosure brings about accountability, and accountability will strengthen the credibility of medical research, the marketing of ideas and, ultimately, the practice of medicine. The lack of transparency regarding payments made by the pharmaceutical and medical device community to physicians has created a culture that this law should begin to change substantially. The reform represented in the Grassley-Kohl Sunshine Law is in patients’ best interest.
The healthcare industry pays physicians a lot, almost $6.5B in 2014 alone. What is being paid for though (or, what does industry report the payments are for)? Who’s getting paid, and how much? I decided to do a quick analysis to start answering these questions and to see if there was anything interesting at a high level.
Most top paid physicians get paid royalties or license fees
The most a single physician got paid in 2014 was almost $44M. The interesting thing is that for this physician and several other top paid physicians, almost the entire total came from payments that were categorized is this unhelpfully-named category, “Compensation for services other than consulting, including serving as faculty or as a speaker at a venue other than a continuing education program” (orange).
A large majority of the other of the top paid physicians got paid primarily from “Royalty or License” (green), which makes sense: a surgeon may invent a new surgical technique and license it to a medical device company.
Another interesting phenomenon is that a handful of doctors in the top 100 earners were paid by industry solely for their research (purple). The status quo of industry having all the money and thus paying/funding research–sometimes both the design of and execution of the research–can create incentives with negative consequences for the validity of the results.
You can play around with the charts like the one below by zooming, mousing over data points to see their values, and showing/hiding different data series by clicking on each one in the legend. Physician names have been replaced with numbers for anonymity.
Orthopedic surgeons received the most industry payments, followed cardiovascular physicians
Orthopedic surgeons received the most money from industry, almost twice the amount that cardiovascular physicians received, in 2014. Interestingly, most of payments to orthopedic surgeons, and other types of surgeons, were for royalties or licenses (green), whereas most payments for physicians–cardiovascular and otherwise–were for “Compensation for services other than consulting” (orange), “Research” (purple), and “Consulting” (purple).
Click to show interactive chart (some labels are crazy long so embedding didn’t look good. “A&O” stands for “Allopathic & Osteopathic Physicians”):
The healthcare industry pays a lot of money for research
Out of the $6.5B total payments to physicians in 2014, $3.2B, or almost half, of those payments were for research. We can see this when aggregating the payments by the name of the drug or device manufacturer: companies like Genentech, Pfizer, and Novartis dominate the dollar amount of payments made to physicians, and most of their payments are for “Research” (brown). Further down the line, you can see medical device manufacturers like Stryker and Medtronic paying physicians mostly for “Royalty and License” (green).
Click to show interactive chart:
Physicians in CA received, by far, the most amount of money from industry.
The graph below shows how much money physicians received for research and “general” payments (any payment that isn’t classified as “Research”), grouped by the state they work in; the size of each bubble represents the number of physicians in that state.
CA had significantly more physicians receive payments (8081) than the runner-up state, NY (5981), and thus the physicians that worked in CA received a lot more money from industry, in aggregate.
Though drilling into state by state differences in the data (e.g. the dominant “purpose” CA physicians vs. physicians in other states get paid for) is an exercise for another time, we get a hint for why this phenomenon might exist by looking at the teaching hospitals that were affiliated with the physicians who got paid by industry the most.
Do physicians get rewarded with fancy dinners and extravagant trips?
By looking at the data, we can find which physicians got paid the most for “Entertainment”, “Food and Beverage”, and “Travel and Lodging”. But we won’t know for sure, because remember, all this payment data is reported by the healthcare industry themselves, and while there are some financial penalties for inaccurate reports, I don’t see an easy way for the government to verify the validity of the data.
The “worst offenders” were essentially given, by industry, $60 meals three meals a day for every day of the year, went on $590 per day trips, and spent $43 a day (about $300 a week) for entertainment and fun. Sounds like the life (except a little more on the entertainment and fun please).
There’s a lot of money being transferred from the healthcare industry to physicians, which means a ton of data since all of this has to be reported now. In fact, I didn’t even touch another part of the dataset, how much ownership each physician has in a particular drug or device manufacturer, which could give even more color on misaligned incentives. Also, without aggregation of some of the data fields, the raw, transaction/payment level data took up close to 6GB of space, and I didn’t want to spin up a Spark cluster or something. Luckily, the Open Payments site provides a web service that allowed me to aggregate and filter the raw data, dramatically reducing the dataset’s size.