Following my previous post on judging the serverlessness of a technology, I apply this criterion to AWS Lambda. I argue that the timeout and memory size configuration parameters are non-essential and should be made optional. The need to think about them makes Lambda less serverless than it could be.
The way you naturally write a function is to finish as soon as possible. It’s just good engineering and good for business. Why then artificially limit its execution time?
The most common case I hear about using timeout is when a Lambda calls some external API. In this scenario, it is used as a fail-safe in case the API takes too long to respond. A better approach is to implement a timeout on the API call itself, in code, and fail the Lambda gracefully if it does not respond in time instead of relying on the runtime to terminate your function. That’s also good engineering.
So here’s my first #awswishlist entry: Make timeout optional and let functions run as long as they need to.
On memory size
I have two issues with the memory size parameter.
First of all, it’s a leaky abstraction of the underlying system. You don’t just specify how much memory your function gets, but also the CPU power. There’s a threshold where the Lambda container is assigned 2 vCPUs instead of 1. Last time I checked this was at 1024 MB, but there’s no way of knowing this unless you experiment with the platform. Since Lambda does not offer specialized CPU instances like EC2 does (yet?), it might not matter, but I worked on a data processing application where this came into play. Why not allow us to configure this directly? What if I need less memory but more vCPUs?
However a more serious point of contention for me is that setting the memory size is an issue of capacity planning. That’s something that should have gone away in the serverless world. You have to set it for the worst possible scenario as there’s no “auto-scaling”. It really sucks when your application starts failing because a Lambda function suddenly needs 135 MB of memory to finish.
Hence here’s my second #awswishlist entry: Make memory size optional. Or provide “burst capacity” for those times a Lambda crosses the threshold.
Now I won’t pretend I understand all the complexities that are behind operating the Lambda platform and I imagine this is an impossible request, but one can dream.
And while I’m at it, a third #awswishlist item is: Publish memory consumed by a Lambda function as a metric to CloudWatch.
I do see value in setting either of these parameters, but I think those are specialized cases. For the vast majority of code deployed on Lambda, the platform should take care of “just doing the right thing” and allow us, developers, to think less about the ops side.
Configuring workload management (WLM) for a Redshift cluster is one of the most impactful things you can do to improve the overall performance of your queries.
The goal is, roughly speaking, to have as less slots per queue as possible with as less — ideally none — wait time in each queue as possible. This will ensure that queries have the most amount of memory available (which helps with query execution speed as intermediate results don’t have to be written to disk) while, at the same time, they execute immediately.
There’s no golden rule on how to configure WLM queues, as it is really use-case specifics. I recommend starting very simple. By default, there’s a single queue with concurrency level of 5. This is, most probably, insufficient — queries won’t be executed immediately, but will be waiting for a slot to free up. Increase it (say, to 15) and monitor the wait time over the next few days.
You can use the
v_check_wlm_query_trend_hourly admin view from the tremendously useful amazon-redshift-utils and plot it on a graph.
You are only interested in those with a
service_class > 5 as first five are internal and you cannot change their configuration.
In the graph above you can see that there’s pretty much no wait time on the queue, which is a good thing. In such a case you can experiment with reducing the concurrency level to increase the memory-per-slot of a queue. Use this query to inspect the memory allocation and concurrency level of your queues:
SELECT service_class, query_working_mem as mem_mb_per_slot, num_query_tasks as concurrency_level
WHERE service_class > 5
ORDER BY 1;
Finally, make sure to set a query timeout (maximum time it can run) on your WLM queues. A runaway query can bring your cluster to a halt.
Figuring out the sweet spot for your WLM setup takes a while and you should revisit it regularly as your system evolves. The great thing about changing WLM config is that tweaking the properties of a queue does not require a cluster reboot so you won’t disrupt the work of your colleagues by experimenting with the setup.
There is a lot of fine-grained parameters you can adjust and tons more to learn about WLM (my favourite gem is wlm_query_slot_count). Yet already a very basic setup will help with the overall cluster performance. It is absolutely worth the effort to understand and implement WLM.
In my day job, we’re using Lambda and Step Functions to create data processing pipelines. This combo works great for a lot of our use cases. However for some specific long running tasks (e.g. web scrapers), we “outsource” the computing from Lambda to Fargate.
This poses an issue – how to plug that part of the pipeline to the Step Function orchestrating it. Using an Activity does not work when the processing is distributed among multiple workers.
A solution I came up with is creating a gatekeeper loop in the Step Function to oversee the progress of the workers by a Lambda function. This is how in looks:
The gatekeeper function (triggered by the GatekeeperState) checks, if external workers have finished yet. This can be done by waiting until an SQS queue is empty, counting the number of objects in an S3 bucket or any other way indicating that the processing can move onto the next state.
If the processing is not done yet, the gatekeeper function raises a
NotReadyError. This is caught by the
Retry block in the Step Function, pausing the execution of a certain period of time, as defined by its parameters. Afterwards, the gatekeeper is called again.
Eventually, if the work is not done even after
MaxAttempts retries, the ForceGatekeeperState is triggered. It adds a
"force: true" parameter to the invocation event and calls the gatekeeper right back again. Notice that the gatekeeper function checks for this
force parameter as the very first thing when executed. Since it’s present from the ForceGatekeeperState, it returns immediately and the Step Function moves on to the DoneState.
For our use case, it was better to have partial results than no results at all. That’s why the ForceGatekeeperState is present. You can also leave it out altogether and have the Step Function execution fail after
MaxAttempt retries of the gatekeeper.
The default way of creating a zip package that’s to be deployed to AWS Lambda is to place everything – your source code and any libraries you are using – in the service root directory and compress it. I don’t like this approach as, due to the flat hierarchy it can lead to naming conflicts, it is harder to manage packaging of isolated functions and it creates a mess in the source directory.
What I do instead is install all dependencies into a
lib directory (which is as simple as
pip install -r requirements.txt -t lib step in the deployment pipeline) and set the
PYTHONPATH environment variable to
/var/runtime:/var/task/lib when deploying the Lambda functions.
This works because the zip package is extracted into
/var/task in the Lambda container. While it might seem as an unstable solution, I’ve been using this for over a year now without any problems.
TL;DR: I’m open-sourcing a continuous deployment pipeline built for AWS to automate the process of creating and deploying AWS Lambda functions and related infrastructure.
Because of my tinkering with Alexa, I wanted to have an automated way of deploying a new version of a Lambda function just via
git push. Doing it manually is cumbersome. As of late, AWS offers all the tools necessary to do so. Their Code* family of services (CodeCommit, CodeBuild & CodePipeline) are the perfect building blocks to set up this process.
Furthermore, I also wanted to automate the necessary infrastructure and treat it as code. That’s where CloudFormation comes in. I didn’t have any prior knowledge of CloudFormation, so it was a great learning experience. I used this excellent template as a start point and I want to thank to the guys over at Cloudonaut for publishing it. Still, it took me a lot of time to grasp all the concepts of CFN and I went through a lot of trial-and-error to figure out how everything ties in together.
In the end, I’m very happy with the result. This initial version is quite basic, but it works well. What makes it cool is that the pipeline is self-referencing, so any changes you make to it get automatically applied. You can read the details about how it works in the README.
I will be expanding its functionality, feel free to star the repo on GitHub and follow along.