Rspamd: detecting PESEL in PDF attachments – works in mail body but not in PDF

hello99974

Hello,

I’m having an issue with configuring Rspamd in Mailcow for a simple DLP rule to detect Polish PESEL numbers (11 digits) in email content and attachments (especially PDF files).

Current behavior:

detection works correctly in the email body,

but it does not work for PDF attachments, even though I installed pdftotext (poppler-utils) and enabled extract_text in mime_types.conf.

My configuration:

mime_types.conf:

extract_text = true;

pdf = {
extract_text = true;
}

rspamd.local.lua:

rspamd_config:register_symbol({
name = ‘PESEL_FOUND’,
score = 1.0,
callback = function(task)
local parts = task:get_text_parts()
if not parts then return false end

for _, part in ipairs(parts) do
  local text = tostring(part:get_content())
  -- Szukamy 11 cyfr
  if text:match("%d%d%d%d%d%d%d%d%d%d%d") then
    return true
  end
end
return false

end
})

Logs show that the symbol is executed, but it only finds PESEL numbers in the email body, not in PDF attachments (the PDF contains real text, not scanned images).

Additional info:

rspamadm configtest → syntax OK

which pdftotext inside the rspamd container returns a valid path

in the Rspamd WebUI I do not see PESEL_FOUND triggered for PDFs

My questions:

Do I need to enable any additional module in Mailcow/Rspamd so that task:get_text_parts() includes extracted text from PDF attachments?

Does anyone have a working example of detecting sensitive data (PESEL / credit card numbers) inside PDF attachments?

Does Mailcow limit PDF text extraction for performance or security reasons?

I would really appreciate any hints or examples of a working configuration.

ETNyx

Not sure, i would expect ~~get_text_parts()~~ is not right, try get_parts() instead.

get_text_parts() - Get all text (and HTML) parts found in a message. Not sure if this apply to you even when you extract pdf as text

on the other side
get_parts() - Get all mime parts found in a message. Should include all parts, hopefully extracted PDF.

https://docs.rspamd.com/lua/rspamd_task/

hello99974

It didn’t work for me :[

when I upload my pdf it looks like this:
maybe its encoding problem somewhere? (the pesel number has a text inside pdf, not image)

ETNyx

Seems you will need to make some kind of conversion, that’s too much complicated without trying it by my self, so maybe our new friends AI’s can help,…

Use it on your OWN RISK!! I did not bother to read it,…

local rspamd_logger = require "rspamd_logger"
local rspamd_util = require "rspamd_util"
rspamd_config:register_symbol({
  name = 'PESEL_FOUND',
  score = 1.0,
  callback = function(task)
    -- 1. Check text body parts (email body)
    local text_parts = task:get_text_parts()
    if text_parts then
      for _, part in ipairs(text_parts) do
        local content = part:get_content()
        if content and tostring(content):match('%d%d%d%d%d%d%d%d%d%d%d') then
          return true
        end
      end
    end
    -- 2. Check PDF attachments by extracting text via pdftotext
    local parts = task:get_parts()
    if parts then
      for _, part in ipairs(parts) do
        local ct = part:get_type()
        if ct == 'application' then
          local st = select(2, part:get_type())
          if st == 'pdf' then
            -- Write part content to a temp file, run pdftotext
            local fname = task:store_in_file()
            -- Actually, we need the part content, not the whole message
            local content = part:get_content()
            if content then
              local tmpfile = rspamd_util.create_tmp_file()
              if tmpfile then
                local f = io.open(tmpfile, 'wb')
                if f then
                  f:write(tostring(content))
                  f:close()
                  local out_file = tmpfile .. '.txt'
                  local rc = os.execute(
                    string.format('pdftotext %s %s 2>/dev/null',
                      tmpfile, out_file))
                  if rc then
                    local tf = io.open(out_file, 'r')
                    if tf then
                      local text = tf:read('*all')
                      tf:close()
                      os.remove(out_file)
                      os.remove(tmpfile)
                      if text and text:match('%d%d%d%d%d%d%d%d%d%d%d') then
                        return true
                      end
                    end
                  end
                  os.remove(out_file)
                  os.remove(tmpfile)
                end
              end
            end
          end
        end
      end
    end
    return false
  end
})