I had my first encounter of a real world race condition today while fixing a bug for Discourse. For some reason, users were not getting a notification when an automated moderator post is added to the topic. After tracing through the source code, I found the method that was responsible for adding the automated post:

def add_moderator_post(user, text, opts=nil)
  opts ||= {}  
  new_post = nil
  Topic.transaction do
    creator = PostCreator.new(user,
                              raw: text,
                              post_type: opts[:post_type] || Post.types[:moderator_action],
                              action_code: opts[:action_code],
                              no_bump: opts[:bump].blank?,
                              skip_notifications: opts[:skip_notifications],
                              topic_id: self.id,
                              skip_validations: true)
    new_post = creator.create
    increment!(:moderator_posts_count)
  end

  # ...
end

We can see that Discourse uses a PostCreator class to handle the logic of creating a new post and that was being done in Topic.transaction. After creating the post, a counter for the number of moderator posts is increased by one. The next step was to look at PostCreator#create and find out what is happening in there:

  def create
    if valid?
      transaction do
        # ...
        save_post
        # ...
      end
    end

    if @post && errors.blank?
      # ... 
      enqueue_jobs
      # ...
    end

    # ...

    @post
  end

We can see that enqueue_jobs is called only after the post has been saved within the transaction. Within the job, the execute method takes in the post_id and retrieves the post from the database.

module Jobs
  class PostAlert < Jobs::Base

    def execute(args)
      # maybe it was removed by the time we are making the post
      if post = Post.find_by(id: args[:post_id])
        # maybe the topic was deleted, so skip in that case as well
        PostAlerter.post_created(post) if post.topic
      end
    end

  end
end

First thing I did was to check if the job was even executed at all. Checking Sidekiq's logs, I didn't notice any failing jobs so that wasn't too helpful. Since the job is being called, that means the condition in the job was not evaluating to true. To confirm my suspicion, I threw in a puts statement:

def execute(args)
  puts Post.find(args[:post_id])

  # maybe it was removed by the time we are making the post
  if post = Post.find_by(id: args[:post_id])
    # maybe the topic was deleted, so skip in that case as well
    PostAlerter.post_created(post) if post.topic
  end
end

Boom! ActiveRecord::RecordNotFound: Couldn't find Post with 'id'=752 [WHERE ("posts"."deleted_at" IS NULL)]. I didn't understand what was happening. The record should have been saved and the Sidekiq job should be able to query for it. I made a wild guess and thought it was a timing issue (I was totally wrong on this) and decided to throw in a sleep(10):

def execute(args)
  sleep(10)
  puts Post.find(args[:post_id])

  # maybe it was removed by the time we are making the post
  if post = Post.find_by(id: args[:post_id])
    # maybe the topic was deleted, so skip in that case as well
    PostAlerter.post_created(post) if post.topic
  end
end

It finally worked but I still didn't know the root cause. However, I did make an adjustment by changing Post.find_by to Post.find. The difference is that Post.find_by returns nil if it doesn't find any record which means the job is silently ignored. With Post.find, an exception would have been raised and Sidekiq would retry the job at a later time.

After tracing through the code a couple more times, I finally noticed that we had two transactions going on with one being nested in the other. The code was essentially something like this:

def add_moderator_post(user, text, opts=nil)
  # ...
  Topic.transaction do
    creator = PostCreator.new(user,
                              raw: text,
                              post_type: opts[:post_type] || Post.types[:moderator_action],
                              action_code: opts[:action_code],
                              no_bump: opts[:bump].blank?,
                              skip_notifications: opts[:skip_notifications],
                              topic_id: self.id,
                              skip_validations: true)

    # Called by creator.create
    if valid?
      transaction do
        # ...
        save_post
        # ...
      end
    end

    if @post && errors.blank?
      # ... 
      enqueue_jobs
      # ...
    end

    increment!(:moderator_posts_count)
  end

  # ...
end

After realizing that we were enqueuing a job within a transaction, it finally all made sense. In order for a record to be persisted in the database, it has to be committed. In the case of having nested transactions, the newly created post was not committed into the database until Topic.transaction was done and we were enqueuing the job before Topic.transaction had finished whatever it had to within its block. Since is Sidekiq is so fast, it was able to execute the job even before the code within Topic.transaction has completed which explains why the post record could not be fetched. After understanding what was happening, the fix was pretty straight forward:

-    Topic.transaction do
-      creator = PostCreator.new(user,
-                                raw: text,
-                                post_type: opts[:post_type] || Post.types[:moderator_action],
-                                action_code: opts[:action_code],
-                                no_bump: opts[:bump].blank?,
-                                skip_notifications: opts[:skip_notifications],
-                                topic_id: self.id,
-                                skip_validations: true)
-      new_post = creator.create
-      increment!(:moderator_posts_count)
-    end
+    creator = PostCreator.new(user,
+                              raw: text,
+                              post_type: opts[:post_type] || Post.types[:moderator_action],
+                              action_code: opts[:action_code],
+                              no_bump: opts[:bump].blank?,
+                              skip_notifications: opts[:skip_notifications],
+                              topic_id: self.id,
+                              skip_validations: true)
+    new_post = creator.create
+    increment!(:moderator_posts_count) if new_post.persisted?

I hope you learned something today as I have and feel free to ask me any questions in the comments below :)