Blocking Sensitive Content using Nginx and Docker

I'm smart enough to know that I'm dumb. , Richard Feynman

21 June 2018

Introduction

Web application firewalls (WAFs) are often deployed by security professionals to protect applications against malicious attacks. Some of these like the popular opensource Mod-Security, can inspect both the incoming request and the outgoing response. It can detect web attacks and information leakage. There are also cloud-based WAFs such as those by Cloudflare, Securi etc... that make it easy to protect a web application or website.

Not all web application firewalls offer outgoing response inspection. Some WAFs solely focused on analyzing incoming requests to stop an attack before it can reach the application. This article shows how to build a simple Nginx module that can inspect outgoing response body for sensitive data and block the response. The module uses PCRE regular expression library to inspect content and is based on a fork of Weibin Yao's nginx substitution filter.

This module can be useful as an additional layer of defense against web attacks. It can complement a WAF that only analyzes incoming requests. In this article, the module will be compiled into Nginx and packaged as a Docker image.

Article last updated Nov 2020.

Design and Approach

This section gives an overview of how the content filter module is designed and the way that it can be used to block sensitive content. It briefly explains Weibin Yao's substitution module and the differences between his original module and the forked content filter.

Weibin Yao's substitution module matches specific content in the HTTP response body, using either regular expression or fixed strings. It can replace these matches with specific values. This replacement functionality can already be used to "block" sensitive information. For example, a regular expression can match for Singapore identity card number (NRIC) and replace it with a single blank space.

However, it may be convenient to prevent an entire html content or page from being displayed if it contains sensitive identity card numbers. Weibin Yao's original substitution module can be modified to do this through a forked version.

Module Setup Diagram

The following diagram illustrates one of the ways that this content filter can be used to detect and block outbound sensitive information.

Fig 1. Nginx Reverse Proxy to Filter and Block Sensitive Content

Nginx is compiled with the content filter module and run as a reverse proxy in front of a web application. It inspects the outgoing content from the web application using regular expression. If a specific number of matches for sensitive data occurred, the content is blocked and Nginx displays an empty page instead of the orginal response.

Besides a reverse proxy setup, Nginx can also be configured as a webserver directly with the content filter enabled. The web content served by Nginx will pass through the content filter. If sensitive information is detected, the content filter can send an empty page instead of the original data.

WeiBin Yao's Substitution Module

This section will require some knowledge about nginx module internals. Refer to an earlier article, Writing an Nginx Response Body Filter Module for a quick introduction on how to code a simple Nginx filter. There are also links to other resources for developing Nginx modules at the end of the article.

Nginx uses a chain of buffers to store outgoing response data. This chain of buffers can be passed to third party filters for additional processing before being sent to the user.

Weibin Yao's substitution module processes each of the outgoing response buffers, looking for a linefeed character (\n). When a linefeed is found, the characters up till and including the linefeed is stored into a buffer variable, ctx->line_in. Matching and substitution is then performed on ctx->line_in and a new string with replacements copied to a buffer variable, ctx->line_dst.

ctx->line_dst is itself eventually copied to ctx->out_buf. ctx->out_buf is the last buffer in the ctx->out chain of buffers.

Weibin Yao's module holds back (output buffering) a HTTP response until it has been matched and substituted. Lines that are matched and substituted can be sent, while the rest of the content are held back pending the regular expression matching. ctx->out stores a chain of buffers containing the matched and substituted response data that will eventually be sent to the user.

Whenever new output storage is needed, a function creates a new nginx buffer chain structure and a new buffer structure. The buffer structure is part of the chain structure. This chain struct is then appended to ctx->out. The buffer structure is assigned to ctx->out_buf.

When the lines in the response body content have been processed, the ctx->out chain is passed along to the next filter in Nginx. This modified content will eventually be delivered to the user after it has cleared all other nginx filters.

Content Filter Module

The new content filter module will retain most of the logic in Weibin Yao's module. However, it doesn't need to do any replacements or substitutions. Instead, it keeps a count for the number of matches per regular expression. Unlike the original substitution module, the content filter will not do fixed string matching. All matching will be done through the PCRE regular expression engine. The new module will only do case insenstive comparisons.

If the number of matches for a particular regular expression equals or exceeds a specified threshold, the content is deemed to be sensitive and will be blocked. Blocking here means that an empty page will be sent instead of the original content.

The content filter tries to avoid using HTTP chunked transfer encoding. It sets a proper HTTP content length so that the user browser knows the amount of data to expect up front. If there are other filters in Nginx that are enabled; for example, gzip. These other filters can change the tranfer mode again to chunked encoding when it is their turn to process the output.

There is a log only mode available so that HTTP content is not blocked even if sensitive information is detected. If this mode is enabled, the content filter will only log an alert. This mode can be useful for troubleshooting.

The content filter will buffer the HTTP response until it has been processed. This output buffering is necessary as the filter will sent a blank page instead of the original content if sensitive information is detected.

Regular expression comparisons can be computationally resource intensive. The content filter has a defined maximum size, NGX_HTTP_CT_MAX_CONTENT_SZ. This is set to 10MB by default. It can be changed in the module source code.

Static files that exceed this limit will not be processed by the content filter. A blank empty page will be sent to the user. For variable dynamic content (eg. those generated by scripts), the content filter will not process HTTP output that is beyond this size limit. When it detects the limit has been reached, the content filter will send a blank empty page.

Just like the original Weibin Yao's substitution module, the content filter will not handle compressed data. Compressed data will be allowed to pass through. In a reverse proxy setup, the upstream web application should not compress HTTP content using deflate or gzip. The Nginx proxy though, can be configured with gzip compression. In Nginx, the gzip module can be run after the content filter has examined the data.

Regular expression usually don't match binary data in most use cases; it is more for matching textual data. Weibin Yao's module has a directive for defining the content types that will be processed. For example, text/plain, application/javascript, text/css etc... The content filter retains this feature too. The default is to match text/html.

Some of the original code in the substitution module has been refactored to make it clearer and easier to understand.

Implementation

This section will run through parts of the content filter source code. The full source code is available from the Github link at the bottom of the article. It is assumed that the reader understands how a basic Nginx module is structured and how a module works. Refer to the links at the end of the article for resources on how to develop an Nginx module.

Configuration Directives

The following snippet shows the configuration directives that the content filter accepts.

static ngx_command_t  ngx_http_ct_filter_commands[] = {

      { ngx_string("ct_filter"),
      NGX_HTTP_LOC_CONF|NGX_CONF_TAKE2,
      ngx_http_ct_filter,
      NGX_HTTP_LOC_CONF_OFFSET,
      0,
      NULL },

    { ngx_string("ct_filter_logonly"),
      NGX_HTTP_MAIN_CONF|NGX_HTTP_SRV_CONF|NGX_HTTP_LOC_CONF|NGX_CONF_1MORE,
      ngx_conf_set_flag_slot,
      NGX_HTTP_LOC_CONF_OFFSET,
      offsetof(ngx_http_ct_loc_conf_t,logonly),
      NULL },

    { ngx_string("ct_filter_types"),
      NGX_HTTP_MAIN_CONF|NGX_HTTP_SRV_CONF|NGX_HTTP_LOC_CONF|NGX_CONF_1MORE,
      ngx_http_types_slot,
      NGX_HTTP_LOC_CONF_OFFSET,
      offsetof(ngx_http_ct_loc_conf_t, types_keys),
      &ngx_http_html_default_types[0] },

    { ngx_string("ct_line_buffer_size"),
      NGX_HTTP_MAIN_CONF|NGX_HTTP_SRV_CONF|NGX_HTTP_LOC_CONF|NGX_CONF_TAKE1,
      ngx_conf_set_size_slot,
      NGX_HTTP_LOC_CONF_OFFSET,
      offsetof(ngx_http_ct_loc_conf_t, line_buffer_size),
      NULL },

    { ngx_string("ct_buffers"),
      NGX_HTTP_MAIN_CONF|NGX_HTTP_SRV_CONF|NGX_HTTP_LOC_CONF|NGX_CONF_TAKE2,
      ngx_conf_set_bufs_slot,
      NGX_HTTP_LOC_CONF_OFFSET,
      offsetof(ngx_http_ct_loc_conf_t, bufs),
      NULL },

    ngx_null_command
};

The ct_filter directive takes 2 arguments and can occur in the Nginx configuration location block. The first argument is the regular expression to compare against each line of the response body. The second is the threshold for the number of matches. If the number of matches for the entire response body equals or exceeds this threshold, the content is flagged as sensitive.

The ct_filter_logonly takes a on/off value and can occur in the main, server or location blocks of the Nginx configuration file. By default ct_filter_logonly is set to off. When this directive is set to "on", the module will not block sensitive content. It will only log that sensitive information has been detected. This option is useful when tuning the regular expression or troubleshooting issues.

The ct_filter_types specifies the MIME content type of the HTTP responses that the content filter will process. The default is text/html. Additional types such as text/plain, application/javascript etc... can be specified so that the module will inspect these for sensitive information.

The other parameters ct_line_buffer_size and ct_buffers are directives for tuning the module. ct_line_buffer_size specifies the initial buffer size for storing a line, the default is 8 x pagesize. On most system it should be 8 x 4096 (32768) bytes. The ct_buffers specifies the number of buffers and the size of each buffer. This directive can be used to tune the number of buffers used by the module and the size of each buffer.

Some Datastructures and Definitions

The following shows the code snippet for some of the data structures and definitions used by the Nginx content filter module.

#define NGX_HTTP_CT_MAX_CONTENT_SZ 1024 * 1024 * 10
#define NGX_HTTP_CT_BUF_SIZE 4096

typedef struct {
     ngx_str_t      match;
#if (NGX_PCRE)
    ngx_regex_t   *match_regex;
    int           *captures;
    ngx_int_t      ncaptures;
#endif
    unsigned int    occurence;
    unsigned int    matched;
} blk_pair_t;


typedef struct {
    ngx_array_t   *blk_pairs; /* array of blk_pair_t */
    ngx_flag_t    logonly;   /* flag to indicate logging only */
    ngx_chain_t   *in;

    /* the line input buffer before substitution */
    ngx_buf_t     *line_in;

    /* the last output buffer */
    ngx_buf_t     *out_buf;
    /* point to the last output chain's next chain */
    ngx_chain_t  **last_out;
    ngx_chain_t   *out;

    ngx_chain_t   *busy;

    /* the freed chain buffers. */
    ngx_chain_t   *free;

    ngx_int_t      bufs;

    unsigned       last;
    unsigned int    matched;
    unsigned int    logonce;
    
    /* output content size */
    off_t          contentsize;

} ngx_http_ct_ctx_t;

NGX_HTTP_CT_MAX_CONTENT_SZ defines the maximum size of the HTTP response that the filter module will process. The NGX_HTTP_CT_BUF_SIZE defines the size of buffer used by the function for sending an empty page.

blk_pair_t is a data structure that holds the compiled regular expression (match_regex) used for comparison, the threshold for the number of matches (occurence) that determines if the content is sensitive, and an integer variable (matched) that tracks the number of matches for the regular expression.

ngx_http_ct_ctx_t is the request module context. It allows the module to track and maintain state per request. The matched variable here indicates whether sensitive information has been detected. The contentsize variable is used to keep track of size of output that has been processed by the module. If contentsize exceeds NGX_HTTP_CT_MAX_CONTENT_SZ, an empty page will be sent.

The filter header function

The following shows the code snippet for ngx_http_ct_header_filter().

static ngx_int_t
ngx_http_ct_header_filter(ngx_http_request_t *r)
{

    ngx_http_ct_loc_conf_t  *slcf;


    slcf = ngx_http_get_module_loc_conf(r, ngx_http_ct_filter_module);

    if(slcf == NULL)
    {
        return ngx_http_next_header_filter(r);
    }


    if (slcf->blk_pairs == NULL
        || slcf->blk_pairs->nelts == 0
        || r->header_only
        || r->headers_out.content_type.len == 0)
    {
        return ngx_http_next_header_filter(r);
    }


    if (ngx_http_test_content_type(r, &slcf->types) == NULL) {
        return ngx_http_next_header_filter(r);
    }

    //Check for compressed content
    if(ngx_test_ct_compression(r) != 0)
    {//Compression enabled, don't filter
        ngx_log_error(NGX_LOG_WARN, r->connection->log, 0,
                     "[Content filter]: ngx_http_ct_header_filter"
                     " compression enabled skipping");
        return ngx_http_next_header_filter(r);
    }

    #if CONTF_DEBUG
        ngx_log_debug1(NGX_LOG_DEBUG_HTTP, r->connection->log, 0,
                       "[Content filter]: "
                       "http content filter header \"%V\"", &r->uri);
    #endif

    if (ngx_http_ct_init_context(r) == NGX_ERROR) {
        ngx_log_error(NGX_LOG_ERR, r->connection->log, 0,
                     "[Content filter]: ngx_http_ct_header_filter"
                     " cannot initialize request ctx");
        return NGX_ERROR;
    }

    r->filter_need_in_memory = 1;

    return ngx_http_next_header_filter(r);

}

This function handles the response headers and is called by Nginx for every response that it is processing. The function checks that module configuration is configured and that the response is not empty. If the response contains only headers (request is a HTTP HEAD method), it won't be processed further. The response is also checked for compression and its content type. Compressed response will not be processed. Response with content type that is not configured to be handled by the module will not be processed.

One of the difference between the original substitution filter and the code here is the use of chunked transfer encoding. The substitution filter uses chunked transfer encoding due to the fact the content may be changed after replacements and will therefore have different length. It calls an nginx function ngx_http_clear_content_length() to do this.

For our module, there are no content replacements; although a blank empty page may be displayed if sensitive data is detected. The filter module leaves the original content length unchanged.

Another difference is the clearing of last modified header. For performance, the module will not cleared the last modified header. Last modified header is used by web caching mechanism to determine if fresh content needs to be fetched. Not clearing this means that pages can be serviced by caches. This improves performance but can sometimes lead to stale content being displayed. The caches may have to be cleared manually when such cases occured.

The body filter function

The following shows the ngx_http_ct_body_filter() function.

static ngx_int_t
ngx_http_ct_body_filter(ngx_http_request_t *r, ngx_chain_t *in)
{
    ngx_int_t               rc;
    ngx_log_t               *log;
    ngx_chain_t             *cl;
    ngx_http_ct_ctx_t       *ctx;
    ngx_http_ct_loc_conf_t  *slcf;

    log = r->connection->log;

    slcf = ngx_http_get_module_loc_conf(r, ngx_http_ct_filter_module);
    if (slcf == NULL) {
        return ngx_http_next_body_filter(r, in);
    }

    ctx = ngx_http_get_module_ctx(r, ngx_http_ct_filter_module);
    if (ctx == NULL) {
        return ngx_http_next_body_filter(r, in);
    }

    #if CONTF_DEBUG
        ngx_log_debug1(NGX_LOG_DEBUG_HTTP, log, 0,
                       "[Content filter]: ngx_http_ct_body_filter"
                       " \"%V\"", &r->uri);
    #endif

    if (in == NULL && ctx->busy == NULL) {
        return ngx_http_next_body_filter(r, in);
    }

    /* Maximum size exceeded */
    if (ctx->contentsize > NGX_HTTP_CT_MAX_CONTENT_SZ  
       || r->headers_out.content_type.len > NGX_HTTP_CT_MAX_CONTENT_SZ) 
    {

        ngx_log_error(NGX_LOG_ALERT, r->connection->log, 0,
                      "[Content filter]: Maximum size exceeded !");

        return ngx_http_ct_send_empty(r,ctx);
    }



    if (ngx_http_ct_body_filter_init_context(r, in) != NGX_OK) {
        goto failed;
    }

    for (cl = ctx->in; cl; cl = cl->next) {

        ctx->contentsize += ngx_buf_size(cl->buf);

        if (cl->buf->last_buf || cl->buf->last_in_chain) {
            ctx->last = 1;
        }

        /* Process each buffer for sensitive content matching */
        rc = ngx_http_ct_body_filter_process_buffer(r, cl->buf);

        if (rc == NGX_ERROR) {
            
            ngx_log_error(NGX_LOG_ERR, log, 0,  
                          "[Content filter]: "
                          "ngx_http_ct_body_filter "
                          "error procesing buffer "
                          "for sensitive content");
            goto failed;
        }
        

        /* Sensitive content is detected and log only disabled */
        if (ctx->matched && !ctx->logonly) {

            if (ctx->logonce == 0) {
                
                ngx_log_error(NGX_LOG_ALERT, r->connection->log, 0,
                              "[Content filter]: Alert ! "
                              "Sensitive content is detected !");
                              
                ctx->logonce = 1;
            }

            return ngx_http_ct_send_empty(r,ctx);
        }
        
        
        /* Maximum size exceeded */
        if (ctx->contentsize > NGX_HTTP_CT_MAX_CONTENT_SZ) {

            ngx_log_error(NGX_LOG_ALERT, r->connection->log, 0,
                          "[Content filter]: Maximum size exceeded !");

            return ngx_http_ct_send_empty(r,ctx);
        }
        


        if (ctx->last) {
            
            /* 
             * last buffer set the last_buf or last_in_chain flag
             * for the last output buffer 
             */
             
            if (ctx->out == NULL) {
                
                if (ngx_http_ct_get_chain_buf(r, ctx) != NGX_OK) {
                    ngx_log_error(NGX_LOG_ERR, log, 0,
                                 "[Content filter]: "
                                 "ngx_http_ct_body_filter "
                                 "cannot get buffer for out_buf");
                    return NGX_ERROR;
              }
              
            }
            

            ctx->out_buf->last_buf = (r == r->main) ? 1 : 0;
            ctx->out_buf->last_in_chain = cl->buf->last_in_chain;
            
        }


    }


    /* It doesn't output anything, return */
    if ((ctx->out == NULL) && (ctx->busy == NULL)) {
        
        ngx_log_error(NGX_LOG_WARN, r->connection->log, 0,
                     "[Content filter]: ngx_http_ct_body_filter "
                     "nothing to output");
                     
        return NGX_OK;
    }
    

    /* Sensitive content is detected */
    if (ctx->matched) {

        if (ctx->logonce == 0) {

            ngx_log_error(NGX_LOG_ALERT, r->connection->log, 0,
                          "[Content filter]: Alert ! "
                          "Sensitive content is detected !");
            ctx->logonce = 1;
        }


        if(!ctx->logonly) { 
         /* logonly is not enabled. Show empty page */
           return ngx_http_ct_send_empty(r,ctx);
        }

    }

    return ngx_http_ct_output(r, ctx, in);

failed:

    ngx_log_error(NGX_LOG_ERR, log, 0,
                  "[Content filter]: ngx_http_ct_body_filter error.");

    return NGX_ERROR;
}

The above function is called by Nginx for each chain of data available from a response body. It loops through a buffer chain containing the buffers that hold the response body. Each buffer is processed using the ngx_http_ct_body_filter_process_buffer() function. If matches for a regular expression equal or exceed the configured threshold, the data is deemed to be sensitive: ctx->matched is set. If logonly is set to "on", the module will allow the original content to be sent to the browser and log an alert indicating that sensitive information is detected. The default behavior is to log an alert and block the sensitive information by displaying a blank empty page.

The total size of all the output that has been processed is tracked by ctx->contentsize variable. If this variable exceeds the maximum size limit, the module will stop processing further buffer chains. An empty page will be sent to the user.

Function that process each buffer

The following shows the ngx_http_ct_body_filter_process_buffer() function.

static ngx_int_t
ngx_http_ct_body_filter_process_buffer(ngx_http_request_t *r, 
                                       ngx_buf_t *b)
{
    size_t               bufsz;
    u_char               *p, *last;
    ngx_int_t            rc;
    ngx_http_ct_ctx_t    *ctx;

    rc = NGX_OK;

    ctx = ngx_http_get_module_ctx(r, ngx_http_ct_filter_module);

    if (b == NULL) {
        ngx_log_error(NGX_LOG_ERR, r->connection->log, 0,
            "[Content filter]: ngx_http_ct_body_filter_process_buffer "
            " input buffer is null");
        return NGX_ERROR;
    }

    bufsz = (size_t) ngx_buf_size(b);

    p = b->pos;
    last = b->last;
    b->pos = b->last; /* buffer is consumed */

    #if CONTF_DEBUG
    
        ngx_log_debug4(NGX_LOG_DEBUG_HTTP, r->connection->log, 0,
                       "[Content filter]: processing buffer: "
                       "%p %uz, line_in buffer: %p %uz",
                       b, last - p,
                       ctx->line_in, ngx_buf_size(ctx->line_in));
    #endif


    if (bufsz != 0) {

        /* Input buffer is not zero */
        rc = ngx_http_ct_body_filter_getline_match(r, p, last, ctx);

    }
    else
    {
        /* Input buffer is zero */
        if (ctx->last) {

            #if CONTF_DEBUG
                ngx_log_debug0(NGX_LOG_DEBUG_HTTP, r->connection->log, 
                               0, "[Content filter]: "
                        "the last zero buffer, try to do substitution");
            #endif

            /* Last buffer try to do a match if line_in is not empty */
            if (ngx_buf_size(ctx->line_in)) {

                rc = ngx_http_ct_match(r, ctx);
                
                if (rc < 0) 
                {
                    ngx_log_error(NGX_LOG_ERR, r->connection->log, 0,
                                  "[Content filter]: "
                                "ngx_http_ct_body_filter_process_buffer"
                                " regex matching for line fails");
                                 
                    return NGX_ERROR;
                }

            }
        
        }


    }

    return rc;

}

The ngx_http_ct_body_filter_process_buffer() function checks the size of the buffer it is processing. If the buffer is not zero, the ngx_http_ct_body_filter_getline_match() function is called. This function will find each line of text in the buffer by looking for the linefeed (\n) character. It then calls the regular expression matching function, ngx_http_ct_match() for each line of text.

If a buffer is zero size, ngx_http_ct_body_filter_process_buffer() checks if it is the last buffer in the HTTP response. For the last buffer, ngx_http_ct_match() will be called if there is any pending data in ctx->line_in. The variable ctx->line_in is used to store text line that is found in the buffer.

It may have data that is still waiting for a linefeed(\n) character to form a line. If the last buffer in the HTTP response is reached, these pending data must still be matched by the regular expression even if there is no linefeed.

Function to find each line for Regex matching

The following shows the ngx_http_ct_body_filter_getline_match() function.

static ngx_int_t
ngx_http_ct_body_filter_getline_match(ngx_http_request_t *r, u_char *p,
u_char *last, ngx_http_ct_ctx_t *ctx)
{
    u_char          *linefeed;
    ngx_int_t       len, rc;


    while (p < last) {

        linefeed = memchr(p, LF, last - p);

        #if CONTF_DEBUG
            ngx_log_debug1(NGX_LOG_DEBUG_HTTP, r->connection->log, 0, 
                           "[Content filter]: find linefeed: %p",
                           linefeed);
        #endif


        if (linefeed) {
            
            /* linefeed found */
            len = linefeed - p + 1;

            if (buffer_append_string(ctx->line_in, p, len, r->pool) 
                == NULL) 
            {
                ngx_log_error(NGX_LOG_ERR, r->connection->log, 0,
                              "[Content filter]: "
                              "ngx_http_ct_body_filter_getline_match"
                              " cannot append to string buffer");
                return NGX_ERROR;
            }

            p += len;

            rc = ngx_http_ct_match(r, ctx);
            
            if (rc < 0) 
            {
                ngx_log_error(NGX_LOG_ERR, r->connection->log, 0,
                              "[Content filter]: "
                              "ngx_http_ct_body_filter_getline_match"
                              " regex matching for line fails");
                return NGX_ERROR;
            }


        }
        else {
            
          /* no linefeed */
          if (buffer_append_string(ctx->line_in, p, last - p, r->pool)
                    == NULL) 
          {
                ngx_log_error(NGX_LOG_ERR, r->connection->log, 0,
                    "[Content filter]: "
                    "ngx_http_ct_body_filter_getline_match"
                    " cannot append to string buffer");
                    
                return NGX_ERROR;
          }

          /* Exit while loop as remaining buffer no linefeed*/
          break;

        }

    }


    if (linefeed == NULL && ctx->last) {

        /* last buffer and no linefeed */
        if (ngx_buf_size(ctx->line_in)) {

            rc = ngx_http_ct_match(r, ctx);
            
            if (rc < 0) {
                
                ngx_log_error(NGX_LOG_ERR, r->connection->log, 0,
                              "[Content filter]: "
                              "ngx_http_ct_body_filter_getline_match"
                              " regex matching for line fails");
                              
                return NGX_ERROR;
                
            }

        }

    }

    return NGX_OK;

}

The function goes through a buffer and look for linefeed character that indicates an end of line. It appends the characters in the line (including the linefeed) into ctx->line_in. When a line is available, it calls the function ngx_http_ct_match() to do the matching. If no linefeed is found in the current buffer, all the content is appended to ctx->line_in, waiting for subsequent buffers which may contain linefeeds.

If no linefeed is found and it has reached the last buffer of the HTTP response, ngx_http_ct_match() is called to do a final matching.

The regex matching function

The following shows the code for ngx_http_ct_match() function.

static ngx_int_t
ngx_http_ct_match(ngx_http_request_t *r, ngx_http_ct_ctx_t *ctx)
{

    ngx_log_t   *log;
    ngx_int_t    count, match_count;
    #if (NGX_PCRE)
    ngx_buf_t   *src;
    ngx_uint_t   i;
    blk_pair_t  *pairs, *pair;
    ngx_str_t input;
    #endif

    match_count = 0;
    count = 0;

    log = r->connection->log;

    if(ngx_buf_size(ctx->line_in) <= 0)
    {
        return match_count;
    }


    #if (NGX_PCRE)
    src = ctx->line_in;

    if (!ctx->matched) {
        /* don't run if sensitive content is already detected */

        pairs = (blk_pair_t *) ctx->blk_pairs->elts;
        for (i = 0; i < ctx->blk_pairs->nelts; i++) {

            pair = &pairs[i];
            input.data = src->pos;
            input.len = ngx_buf_size(src);

            while(input.len > 0)
            {
                /* regex matching */

                pair->ncaptures = (NGX_HTTP_MAX_CAPTURES + 1) * 3;
                pair->captures = ngx_pcalloc(r->pool, 
                                         pair->ncaptures * sizeof(int));

                count = ngx_regex_exec(pair->match_regex, &input, 
                                       pair->captures, pair->ncaptures);
                if (count >= 0) {
                    /* Regex matches */
                    match_count += count;

                    /* To track previous matches */
                    pair->matched++;

                    input.data = input.data + pair->captures[1];
                    input.len = input.len - pair->captures[1];

                    if(pair->matched >= pair->occurence)
                    {
                        ctx->matched++;
                        break;
                    }

                } else if (count == NGX_REGEX_NO_MATCHED) {
                     /* no match break out of while loop */
                     break;

                } else {

                    ngx_log_error(NGX_LOG_ERR, log, 0,  
                                  "[Content filter]: ngx_http_ct_match"
                                  " regexec failed: %i", count);
                    goto failed;
                }

            }


            if (ctx->matched) {
                break;
            }


        }
    }
    #endif


    if (ngx_http_ct_out_chain_append(r, ctx,
        ctx->line_in)!= NGX_OK) 
    {
            
        ngx_log_error(NGX_LOG_ERR, log, 0,  "[Content filter]: "
            "ngx_http_ct_match cannot append line to output buffer: %i", 
            count);
        goto failed;
    }


    ngx_buffer_init(ctx->line_in);

    #if CONTF_DEBUG
        ngx_log_debug1(NGX_LOG_DEBUG_HTTP, log, 0, "[Content filter]: "
                       "match counts: %i", match_count);
    #endif

    return match_count;

failed:

    ngx_log_error(NGX_LOG_ERR, log, 0,
                  "[Content filter]: ngx_http_ct_match error.");

    return -1;
}

The regular expression matching is done in this function. It will go through the array of blk_pair_t, the data structure holding the regular expression. For each blk_pair_t regular expression (pair->match_regex), it will match against the line of data in the ctx->line_in buffer.

If a match is found, the variable pair->matched that tracks the number of matches for a regular expression is incremented. The input line is updated to a new position that is after the matched string. The matching then continues from this new position. The process is repeated until the end of the line.

If the matched variable (pair->matched) equals or exceeds the threshold (pair->occurence) for the regular expression, a flag (ctx->matched) is set to indicate sensitive information is detected. No futher regular expression matching will be done once sensitive data is detected.

We have gone through some of the key parts of the content filter source code. For the full source code, refer to the Github link at the end of the article.

Building the Docker Image

This section uses a Ubuntu linux system (20.04 LTS) with Docker Community Edition installed to build the Nginx image with the content filter module. Refer to Docker Installation for information on how to install and set up Docker.

We will use a docker multi-stage build to create the nginx content filter image. Create a working directory and change to the current path to the new directory.

mkdir mynginx
cd mynginx

Enable Content Trust to verify the docker base images that will be pulled from DockerHub.

export DOCKER_CONTENT_TRUST=1

We will use alpine linux 3.12.1 as the base image for the nginx application. Create a Dockerfile with the following content.

#Docker Image for building
FROM alpine:3.12.1 as builder
COPY build.sh /root
RUN cd root &&\
    chmod 755 build.sh &&\
    ./build.sh


#Actual image to be created
FROM alpine:3.12.1
COPY --from=builder /usr/local/nginx /usr/local/nginx
RUN touch /usr/local/nginx/logs/access.log &&\
    touch /usr/local/nginx/logs/error.log &&\
    ln -sf /dev/stdout /usr/local/nginx/logs/access.log &&\
    ln -sf /dev/stderr /usr/local/nginx/logs/error.log &&\
    addgroup -g 8000 nginx &&\
    adduser -G nginx -u 8000 -D  -s /sbin/nologin nginx &&\
    mkdir /usr/local/nginx/tmp &&\
    chmod 1777 /usr/local/nginx/tmp

USER nginx
EXPOSE 8000/tcp

STOPSIGNAL SIGTERM

CMD ["/usr/local/nginx/sbin/nginx", "-g", "daemon off;"]

The Dockerfile is a multi-stage build, the first portion contains the instructions to create the builder image and compile nginx with the content filter module. A script build.sh is used to download the required sources and compile nginx. The second portion creates an nginx image using the compiled binary created by the builder image.

The nginx application will be run as a normal user instead of root. The logs will be sent to stdout and stderr. A special temporary directory /usr/local/nginx/tmp is created that can be mounted using tmpfs. This allows us to run the nginx image as an immutable read only image. All the temporary files used by Nginx will be written to /usr/local/nginx/tmp which is a tmpfs memory-based filesystem.

Create the build.sh script with the following content.

#!/bin/sh
apk update
apk add wget gcc libc-dev make git g++ perl linux-headers gnupg
mkdir build
cd build
wget https://nginx.org/download/nginx-1.18.0.tar.gz
wget https://ftp.pcre.org/pub/pcre/pcre-8.44.tar.gz
wget https://www.zlib.net/zlib-1.2.11.tar.gz
wget https://www.openssl.org/source/openssl-1.1.1h.tar.gz
git clone https://github.com/ngchianglin/NginxContentFilter.git

nginx_sha256="4c373e7ab5bf91d34a4f11a0c9496561061ba5eee6020db272a17a7228d35f99"
pcre_sha256="aecafd4af3bd0f3935721af77b889d9024b2e01d96b58471bd91a3063fb47728"
zlib_sha256="c3e5e9fdd5004dcb542feda5ee4f0ff0744628baf8ed2dd5d66f8ca1197cb1a1"
openssl_sha256="5c9ca8774bd7b03e5784f26ae9e9e6d749c9da2438545077e6b3d755a06595d9"
content_filter_config="d20e9df127e9e3c87e175b7a2191021a9a3ffc0d94aff5e1dfbdbbaaea033074"
content_filter_module="9779f91da58bcaed9f5697103f14201c9f65746ffeeefb29cf2e34aff7420ef3"

cksum()
{
  checksum=$1
  file=$2
  val="`sha256sum $file  | cut -d ' ' -f1`"

  if [ $val != $checksum ]
  then
      echo "Sha256 sum of package $file does not match !"
      exit 1
  else
      return 0
  fi
}

cksum $nginx_sha256 "nginx-1.18.0.tar.gz"
cksum $pcre_sha256 "pcre-8.44.tar.gz"
cksum $zlib_sha256 "zlib-1.2.11.tar.gz"
cksum $openssl_sha256 "openssl-1.1.1h.tar.gz"
cksum $content_filter_config "NginxContentFilter/config"
cksum $content_filter_module "NginxContentFilter/ngx_http_ct_filter_module.c"

tar -zxvf nginx-1.18.0.tar.gz
tar -zxvf pcre-8.44.tar.gz
tar -zxvf zlib-1.2.11.tar.gz
tar -zxvf openssl-1.1.1h.tar.gz


#
# Take note that alpine linux uses musl as the c library  
# instead of glibc. musl at the moment doesn't
# support _FORTIFY_SOURCE and this option have no 
# effect 
#
cd nginx-1.18.0
./configure --with-cc-opt="-Wextra -Wformat -Wformat-security -Wformat-y2k -Werror=format-security -fPIE -O2 -D_FORTIFY_SOURCE=2 -fstack-protector-all" --with-ld-opt="-pie -Wl,-z,relro -Wl,-z,now -Wl,--strip-all" --with-http_v2_module --with-http_ssl_module --without-http_uwsgi_module --without-http_fastcgi_module   --without-http_scgi_module --without-http_empty_gif_module --with-openssl=../openssl-1.1.1h --with-openssl-opt="no-ssl2 no-ssl3 no-comp no-weak-ssl-ciphers -O2 -D_FORTIFY_SOURCE=2 -fstack-protector-all -fPIC" --with-zlib=../zlib-1.2.11 --with-zlib-opt="-O2 -D_FORTIFY_SOURCE=2 -fstack-protector-all -fPIC" --with-pcre=../pcre-8.44 --with-pcre-opt="-O2 -D_FORTIFY_SOURCE=2 -fstack-protector-all -fPIC" --with-pcre-jit --add-module=../NginxContentFilter
make
make install
cat << EOF > /usr/local/nginx/conf/nginx.conf
worker_processes  1;
events {
    worker_connections  1024;
}

http {
    include       mime.types;
    default_type  application/octet-stream;

    sendfile        on;
    tcp_nopush      on;
    tcp_nodelay     on;
    keepalive_timeout  65;

    server {
        listen       8000;
        server_name  localhost;
        charset utf-8;

        location / {
                root   html;
                index  index.html index.htm;
        }

        error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   html;
        }

    }

}

EOF

The build.sh is used by the builder for compiling nginx from source. Notice that it verifies all the source code that is fetched using sha256 checksums configured in the script itself.

Note that the compiler option _FORTIFY_SOURCE is not supported in the c library, musl, used by alpine linux. This option will have no effect on the final compiled nginx binary.

Let's proceed to build the nginx docker image.

docker build -t mynginx .

A docker image with the tag mynginx will be created. This image contains Nginx compiled with the content filter module. The image comes with a default configuration for Nginx. To run the content filter, we shall use a custom configuration file.

Create a nginx.conf file inside a new directory called conf.

mkdir conf
cd conf
vim nginx.conf

Add the following to nginx.conf

worker_processes  4;
pid        /usr/local/nginx/tmp/nginx.pid;


events {
    worker_connections  1024;
}


http {
    include       mime.types;
    default_type  application/octet-stream;

    sendfile        on;
    tcp_nopush      on;
    tcp_nodelay     on;
    keepalive_timeout  65;
    server_tokens off;
    gzip  on;

    proxy_cache_path /usr/local/nginx/tmp/cache levels=1:2 keys_zone=webcache:2m max_size=20m;
    proxy_cache_key "$scheme$request_method$host$request_uri$is_args$args";
    proxy_cache_valid 200 302 1d;
    proxy_cache_valid 404 1m;

    proxy_temp_path /usr/local/nginx/tmp/proxy_temp;
    client_body_temp_path /usr/local/nginx/tmp/client_body_temp;


    map $sent_http_content_type $cachemap {
        default    no-store;
        ~text/html  "private, max-age=900";
        text/plain  "private, max-age=900";
        text/css    "private, max-age=7776000";
        application/javascript "private, max-age=7776000";
        ~image/    "private, max-age=7776000";
    }

    server {
        listen     8000;
        server_name  localhost;
        root   /usr/local/nginx/html/;
        charset utf-8;


        location / {

            proxy_cache webcache;
            proxy_cache_bypass $http_cache_control;

            proxy_set_header Accept-Encoding "";
            proxy_set_header HOST $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_pass http://mamashop;
            add_header Cache-Control $cachemap;

          # ct_filter_types text/plain application/javascript;
          # ct_filter S\d\d\d\d\d\d\d[A-Z] 1;
          # ct_filter_logonly off;

            index  index.html index.htm;
        }

        # redirect server error pages to the static page /50x.html
        error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   html;
        }
    }

}

The configuration sets up Nginx as a reverse proxy for http://mamashop, where the actual web application is running. The options for the content filter are currently commented out. These can be enabled later.

Set the permission of the nginx.conf file so that it will be readable by the nginx user that is used to run the nginx service in our nginx docker image.

chmod 644 nginx.conf

Testing the Nginx Content Filter

We will use the Vulnerable Mama Shop (VMS) application to test the Nginx Content Filter. Vulnerable Mama Shop has a SQL injection vulnerability that allows user data to be dumped out. Refer to this article Learning SQL Injection using Vulnerable Mama Shop for more information on the Vulnerable Mama Shop application.

Issue the following commands to build VMS.

git clone https://github.com/ngchianglin/VulnerableMamaShop.git
cd VulnerableMamaShop
docker build -t mamashop .

We will create a bridge network for both the nginx content filter image and the mamashop image.

docker network create --driver bridge mynet

Start up mamashop using the following command

docker run -it --rm --disable-content-trust --name mamashop --network mynet mamashop

This will starts up the VMS application on the mynet network. VMS is available at http://mamashop for other docker applications in mynet. Starts up another console and run the mynginx image using the following command.

docker run -it --rm --network mynet -p 8000:8000 --name mynginx -v [home dir]/conf/nginx.conf:/usr/local/nginx/conf/nginx.conf:ro --mount type=tmpfs,destination=/usr/local/nginx/tmp,tmpfs-size=52428800 --read-only mynginx

Note, you need to replace [home dir] with the full path where the custom conf/nginx.conf file is created earlier. It mounts the custom conf/nginx.conf as read only file, replacing the default nginx configuration in the docker image. The command also maps port 8000 on the host to 8000 on the nginx docker image. The nginx docker image will in turn proxy and forward traffic to the mamashop docker image. Notice that /usr/local/nginx/tmp is mapped to a tmpfs and the docker image set to read only.

Visit the http://[host ip]:8000 and you should be able to see the mamashop application. Play around with its functionalities.

Fig 2. Mama Shop Application through the Nginx Reverse Proxy

Let's launch an SQL injection to dump out the user information from the vulnerable application. Configure your browser to use ZAP proxy to intercept requests sent to VMS. Refer to the article Learning SQL Injection using Vulnerable Mama Shop for more information on how to do this.

Take note that for web browsers like firefox, it will not proxy connections to localhost or 127.0.0.1. So you need to access vulnerable mamashop application through your local machine ip. Example, http://192.168.0.25:8000/

Intercept a request to query items for a category. Modify the value of the catid parameter to the following

catid=1000 union select firstname, nric, email from users LIMIT 7, 100

The following screenshot shows how this looks like in the ZAP intercepted request.

ZAP proxy modify category id — Fig 3. ZAP Proxy modify category id

Send the modified request to VMS. A list of user including their email and NRIC (National Registration Identity Card) will be dumped out. At this point, we have not enabled the nginx content filter yet.

Modify the nginx.conf and enable the content filter by uncommenting the following lines (remove the # in front of them).

# ct_filter_types text/plain application/javascript;
# ct_filter S\d\d\d\d\d\d\d[A-Z] 1;
# ct_filter_logonly off;

The ct_filter directive sets up a regular expression to match for NRIC numbers. It has a strict threshold of 1. This means a single match will flag the content as sensitive. ct_filter_logonly is set to off. Content that is deemed to be sensitive will be blocked and a blank page will be displayed. The ct_filter_types directive adds two other MIME types, text/plain and application/javascript. By default the filter will process text/html.

At the console when mynginx is running, type Ctrl-C to terminate the docker instance. Start it up again with the modified configuration file. Exploit the SQL injection vulnerability again and this time you should get a blank page.

Fig 5. Blank Page when content filter enabled

The nginx content filter has stopped the sensitive user list from being dumped out. If you look at the console where the Nginx docker instance is running, there should also be a message saying "Alert ! Sensitive content is detected !"

You can play around with the filter by changing some of its configuration settings, such as setting ct_filter_logonly to on, or changing the regular expression or changing the threshold to some other value. If you want to add another regular expression to match for email address, simply add a new ct_filter directive with the relevant PCRE regular expression and threshold.

Bypassing the Content Filter

The content filter serves as an additional layer of defense against web attacks but it is not foolproof. An attacker can try to bypass the regular expression matching. For example, in the Vulnerable Mama Shop case, we can set the catid parameter with the following

1000 union select firstname, to_base64(nric), email from users LIMIT 7, 100

The SQL injection encodes the NRIC field into base64. This bypass the regular expression configured for detecting NRIC number.

Fig 6. Bypassing the Nginx Content Filter

Notice that in the screenshot, the NRIC numbers are now all in base64 and content filter fails to block this. The attacker can easily convert the NRIC numbers from base64 back to its original alphanumberic value using widely available tools.

To avoid this, we can try to add a regular expression that attempt to detect base64 encoding. However, it is not easy to determine base64 encoding without false positives using regular expression. Base64 encoding uses many of the same characters in the regular alphabet and digits. There can be a lot of false positives. Even if we can formulate a suitable regular expression, it too can be bypassed by attackers. For example, an attacker can add spaces in the formatting of data or even use hexadecimal representation instead of base64.

Another useful technique to enhance the detection of sensitive information leakage is the use of dummy data. For example, we could have inserted dummy user data into the userlist and set up corresponding regular expressions to detect these dummy data. There can be a regular expression to match the origin dummy data as is, a regular expression to match base64 encoded format of the dummy data, a regular expression to match hexadecimal encoded form of the dummy data etc... This can help in detection of data leakage and reduce false positives. But it too is not perfect and can be bypassed.

The Nginx content filter module though is still useful as an additional layer of defense that can thwart simple attacks. When there is a vulnerability in an application, the best way to resolve it is fixing the bug and vulnerability directly. Additional protections such as web application firewalls (WAFs) and outgoing content monitoring can provide some mitigations. These mechanisms though can be bypassed by more advanced attackers.

Conclusion and Afterthought

The Nginx content filter module depends on PCRE for regular expression matching. A possible improvement is use a regex engine that is stream based and non-backtracking. An example is the openresty sregex. The sregex is still under heavy development and its APIs may change without notice. It may be worthwhile to look into using sregex if high performance is required.

Another high performance regular expression engine is Hyperscan which can match multiple regular expressions simultaneously. It also makes use of modern x86 processor hardware instructions like SIMD. Some opensource intrusion detection software like Suricata has support for Hyperscan to enable high performance scanning.

As web attacks continue to evolve, having some means to monitor and protect outgoing data can help to stop and prevent some attacks. The Nginx Content Filter module allows the inspection of outbound response body using PRCE regular expression. While it is not perfect, it can add to the tools that security professionals and defenders have for defeating web attacks.

Useful References

Weibin Yao's Nginx Substitution Filter, The original substitution that Nginx content filter in this article is forked from.
Writing an Nginx Response Body Filter Module, An article that provides information on how to develop and write a simple Nginx Filter module.
Nginx Official Development Guide, The official guide on development for the Nginx.
Emiller’s Guide To Nginx Module Development, A very good tutorial introducing how to develop and write nginx modules.
Learning SQL Injection using Vulnerable Mama Shop, an article about the Vulnerable Mama Shop application that is used here for testing the Nginx content filter. Vulnerable Mama Shop is a vulnerable application that can be used for learning and practising SQL injection.
Blocking Common Attacks using ModSecurity 2.5 Part 3, an article about using modsecurity 2.5 to block attacks. It includes an example of inspecting response body.
ModSecurity Handbook, a useful introduction to Modsecurity application firewall.
ModSecurity CRS - Anomaly Scoring Mode, Documentation on anomaly scoring using OWASP ModSecurity Core Rule Set. Anomaly scoring can also be used on response body.
NAXSI, Nginx Anti XSS & SQL Injection, a web application firewall module for Nginx. NAXSI is a WAF that focus on a small set of rules for detecting web attacks.
Openresty sregex, A high performance regex engine that is non-backtracking.
Musl Libc, A small, lightweight and fast c library used by alpine linux.
Hyperscan, A fast regular expression engine that can match multiple regular expressions simultaneously. It also makes use of many modern CPU hardware features to optimize matching.
Regular Expression Matching Can Be Simple And Fast, Russ Cox's article about regular expression matching approaches and performance. It explains the internals of how regular expression works clearly, covering both non-deterministic finite automata and deterministic finite automata.

The full source code for the Nginx Content Filter is available at the following Github link.
https://github.com/ngchianglin/NginxContentFilter

The scripts and Dockerfile for building the Nginx Content Filter docker image is available at the following Github link.
https://github.com/ngchianglin/Docker-Alpine-NginxContentFilter

If you have any feedback, comments, corrections or suggestions to improve this article. You can reach me via the contact/feedback link at the bottom of the page.

Article last updated on Nov 2020.