Hello,
I’m having an issue with configuring Rspamd in Mailcow for a simple DLP rule to detect Polish PESEL numbers (11 digits) in email content and attachments (especially PDF files).
Current behavior:
detection works correctly in the email body,
but it does not work for PDF attachments, even though I installed pdftotext (poppler-utils) and enabled extract_text in mime_types.conf.
My configuration:
mime_types.conf:
extract_text = true;
pdf = {
extract_text = true;
}
rspamd.local.lua:
rspamd_config:register_symbol({
name = ‘PESEL_FOUND’,
score = 1.0,
callback = function(task)
local parts = task:get_text_parts()
if not parts then return false end
for _, part in ipairs(parts) do
local text = tostring(part:get_content())
-- Szukamy 11 cyfr
if text:match("%d%d%d%d%d%d%d%d%d%d%d") then
return true
end
end
return false
end
})
Logs show that the symbol is executed, but it only finds PESEL numbers in the email body, not in PDF attachments (the PDF contains real text, not scanned images).
Additional info:
rspamadm configtest → syntax OK
which pdftotext inside the rspamd container returns a valid path
in the Rspamd WebUI I do not see PESEL_FOUND triggered for PDFs
My questions:
Do I need to enable any additional module in Mailcow/Rspamd so that task:get_text_parts() includes extracted text from PDF attachments?
Does anyone have a working example of detecting sensitive data (PESEL / credit card numbers) inside PDF attachments?
Does Mailcow limit PDF text extraction for performance or security reasons?
I would really appreciate any hints or examples of a working configuration.