Smartphone Hardening


A smartphone like the Samsung S4 bought only a few years ago will most probably run Android 4.4.x “Kitkat” (or 5.x  if upgraded), as this is the stock ROM it contained right after market introduction. New devices are still sold for ~ 150€ running Android 5.x “Lollipop” which is nearly equally old. Of course, I already flashed Cyanogenmod 11 back then to have more control over the device along w/ root access which enabled me to configure netfilter and install VPN S/W.

But if you follow the Android OS version history it becomes immediately clear that  – as the ppl at LineageOS state – 7/10 run outdated operating systems on their phones. This is a matter of upgrading your device, and that is what I just did, involving testing of lots of different ROMs and Android versions, which I’m going to skip in this post.

The steps to upgrade a S4 LTE (official release date may 2013) from 4.4.x to a quite actual and rooted 8.1 “Oreo” are as follows if you reduce them to the minimum and exclude all the time spent on testing and research:

Step 1: Use heimdall to flash TWRP recovery system onto the device. This can simply be done from the commandline after you put the phone in Download Mode (by pressing VolumeDown+Power while turning the phone on):

sudo heimdall flash --verbose --RECOVERY recovery.img

Step 2: Use heimdall to flash an updated baseband firmware containing an updated kernel and phone/modem related firmware. I prefer the GUI for that step as it gives a far better overview of what we are doing.

This is not as hard as it looks: After you downloaded the .tar file, extract it to a temp folder and see which files it contains. Afterwards, use heimdall to download the devices partition layout table (PIT). Next thing to do is select the PIT file, then hit the “Add” button and select each partition and its according file from the folder you extracted the .tar file, select “No Reboot” and “Resume” and finally hit the start button.

Step 3: Flash a new ROM onto the device via TWRP. Start the Device by pressing VolumeUP + Home + Power to enter its recovery mode. From there, select relevant files in the right order (and compare its checksums) which in my case were: d3213c4895e2565ee3a7f3dd0d47aedcbe9f621eb8f89f9c51351d92573ae5dd  b5cc465abb3d9b7ad0177e74693e1bbd085775fd38808f640be537e8dcd1a3e8  e544ad0aea8702d73f2b2451e42c83cb96157881ce7879dcdea11e2bb4835718

It appears to me that it is easily possible – and even by means of only using freely available S/W – to update all those horribly insecure smartphones out there, and it’s even far more easy to achieve than back in the days. So – I ask myself – why is there no public service offered by the shop you bought your phone at that enables non-technical ppl to get this done eradicating that bad thing called planned obsolescence ?

Addendum: Upgraded from stock Android 6.0 onto LineageOS 15.1 / Android 8.1 on a SM-T585 Tablet (2016) as well (search for “sm-t585” or “gtaxllte” for relevant TWRP and LineageOS images):

sudo heimdall flash --verbose --RECOVERY recovery.img
Initialising connection...
Detecting device...
      Manufacturer: "SAMSUNG"
           Product: "Gadget Serial"
RECOVERY upload successful
Ending session...
Rebooting device...
Releasing device interface...

Interesting to note that this time the device itself does not really get identified. Last but not least: Do not forget to create and redundantly store  backups of the device(s) when finished w/ configuration et al.

Addendum 2: Doing the same for a S4 mini LTE a.k.a. GT-I9195i a.k.a. serranovelte (official release date june 2015) running stock android 4.4.4. TWRP already flashed, important to note that heimdall v1.4.2 – as for the two previous devices – has to be built from source to really work:

git clone
cd Heimdall
cmake . && make && sudo make install

Remember to install some dependencies (like libusb-dev, libqt5 etc.) mentioned in cmake warnings / errors and it builds w/o error and flashes the device successfully. Flashing the lineage 15.1 image now is only a matter of copying a ZIP to SD or USB-OTG and booting the device into recovery.

Database Integrity v2

When it comes to database integrity, a two-hosts master-slave setup looks promising in theory. It surely is better than having only a single DB, but that is about the only pro-argument available.

Approach v1: Master + Slave1 

Okay, so we have a critical system running a database server w/ important production-grade data serving thousands of web clients. This qualifies as a highly critical system, and should never be influenced by additional manual queries to reduce the possibility of operational problems. So if any of the DB data is scheduled to undergo some form of analysis, that is hopefully only going to happen on a slave system. This might generally sound like a perfectly good setup, but only as long as the slave is not having any integrity issues. In this scenario, If the data gets inconsistent, we have to stop both master and slave which means downtime and then copy over all the data, restart master and slave, setup and reinitiate replication, and hope that the master-slave-setup is now fully functional again.


Approach v2: Master + Slave 1 + Slave n 

Instead of simply deploying the previously described setup which is still risky as it is the only chance for a good backup and also forces us to take the production system offline in case of failure, we take a far better approach by simply deploying at least another slave and perhaps another backup system. Now, if one slave fails or has inconsistent data, chances are still very high that one other slave still has good data on it. So, in this scenario, we only have to halt the good slave’s DB, copy it over onto the failed slave, optionally create a  hot backup somewhere else (e.g. a simple rsync onto a NAS) in between, and all of this while the productional system just keeps running w/o any interruption. Also, there is no need to configure the master-slave setup again as the “repaired” slave will just continue where the “good” slave cleanly stopped replicating from the same master. Additionally, backups can be taken at any time w/o greater influence on any of the important assets.

I strongly recommend the latter setup, simply b/c  it survived every  incident there ever was, and that means it is fully functional since many years, regardless of whichever one of the slaves is failing from time to time. 

L3 Hardening: GWx DDoS Mitigation

In the newer ages of the internet, denial-of-service attacks (DoS), their distributed variants (DDoS) and its newest reflected species (DrDoS/rDDoS) took, take and will take place increasingly often. To explain this very quickly: A Denial-of-Service (DoS) attack takes place when a single host attacks another host over the network. Distributed Denial of Service (DDoS) means that lots of often geographically dispersed aggressor hosts conduct the attack (trinoo, tfn2k and stacheldraht were famous tools for that purpose in 2001). As you can imagine, the first DDoS attacks were pretty spectacular b/c of the bandwidth achieved. Afterwards, more sophisticated Layer 7 (Application Layer) attacks were developed, then reflected attacks and finally amplification came into play (see wikipedia for more details).

All these attacks are not only proof of  weaknesses and/or design errors in underlying internet protocols or network service daemons implementing them. They are also depicting their potential power, as such attacks can be a equally handy and  efficient tool for governmental entities and/or their military executive branches that have an interest in e.g. wreaking havoc to a countries essential infrastructure. On a even more sophisticated level, such attacks may be conducted as part of a larger operation w/ the intent to ultimately spoof, intercept or overtake certain communications to and from target host(s) or network(s). The much hyped term “cyberwar” comes into mind, accompanied by a bitter taste of being instrumented by the military-industrial complex to justify questionable regulations and defense budget extensions to “make the internet a safer place”.


Basic Mitigation Theory

If we look into nature, we see that e.g. a river is able to transport certain amounts of water, but when a flood happens b/c of heavy rainfall (a.k.a. distributed denial of service taking place), the original riverbed will be too small to carry all the water which ultimately finds its own ways, forming and rearranging its surrounding landscape by whatever lies on its path. Now, if we look at that on a larger scale, a single river is most of the time only one vein of a certain area’s water transportation system, and if floods happen more often, new smaller rivers might be formed to fulfill the need for larger overall capacity. The more rivers there are, the more water can and eventually will be transported w/o the harsh effects of the previous flood. So, a more complex and dynamic river system is potentially able to fully compensate the initial problem. This split-up principle can also be applied during the mitigation of a large-scale DoS, DDoS or DrDoS/rDDoS attack, subsequently described at a basic technical level.


Technical GWx Principle

Each of the GWx systems is configured to forward and/or proxy packets for given services to the real IP of the productive server. This could be achieved  by implementing packetfilter or routing rules on incoming layer 3 IP traffic or by certain configurations that implement a dedicated proxy / loadbalancer on the application layer.

If a network or host has or itself acts as a single gateway, it can be flooded if the amount of data reaches its maximum bandwidth capacity. So, a 1Gbit DDoS attack will most probably fully saturate and thus take down a system connected via a single 1 Gbit link @ GW0. But, if we implement a second, geographically distant GW1 w/ the same linkspeed and use round robin DNS to evenly spread the requests to both gateways, a 1Gbit attack can no longer fully saturate the bandwidth as each of the GWx systems will only receive its 50% share of it. So, a GWx cluster consisting of 3 systems will reduce that to even shares of 33:33:33 percent, 4 systems to percentages of 25:25:25:25 and so on: x systems = overall bandwidth/x per system.  You see that this system comes w/ auto-grown scalability in mind and is ready to be expanded in realtime just by adding more GWx to the cluster and its underlying round robin DNS configuration.


GWx Hardening

As each and every GWx system will be directly exposed to attack traffic, it should be hardened thoroughly on host and network level. To name only a few, implementing basic packetfilter rules for filtering certainly known-bad, unneeded traffic, and even more sophisticated advances like blocking, delimiting or restricting bandwidth of hosts that send too many requests in a certain timespan, or a mechanism to filter out brute-force attacks to certain services or webpages could be implemented.

Extended host and network monitoring also makes sense here, but may heavily depend on your research capabilities or your intention to analyze and further develop your mitigative skillset. Security is a process, and should neither be seen as, nor advertised and marketed as a snake-oilish product.

Last but not least, it is of course crucial to retain secrecy of the real IP and also deploy packet filtering there to allow only inbound traffic from GWx boxes to the services protected by them.


Practical Insights and Perspectives

Having dealt w/ 30+ large scale (that means at least hundreds of megabit up to a few gigabit) attacks only in the last two years, I observed that they shared all of the specific characteristics (4x GWx, 2 providers, 4 DCs):

  • overall attack duration mostly only a few minutes
  • usually shifted by a few minutes
  • maximum + overall attack bandwidth limited
  • attacker unable to fully disrupt GWx protected services ever since

As DDoS attacks and certain, questionable mitigation techniques (as opposed to lotek, simple, functional and achievable) recently have also become a lucrative business model, the “customer” (or rather attacker) most probably pays for a certain package that seems to limit him to a certain target IP at a time and of course a limited bandwidth. Staying rather stealthy in a long-term period seems to also be a plausible demand for the DDoS provider on the one as well as its “customer” on the other hand, so the average attack will take place mostly during high-load periods and last rather short but occur often, so that fully legitimate clients get really frustrated.

Generally speaking, and if we left out the fact that core network providers are also able to filter e.g. using BGP, one efficient way to mitigate DoS, DDoS and DrDoS/rDDoS attacks would be to form a cyberarmy of GWx machines, geographically spread all over the world and using different providers and physical datacenters – a technique similarly deployed by the circumventive/anti-censorship tor network. But the GWx cyberarmy – in contrast to  botnets – does not have to consist of hundreds or thousands of machines; we only have high bandwidth servers, ideally carefully chosen dedicated root servers, optionally already DDoS protected in their own network.

It could also make sense to have a variable list of GWx systems that could change IPs or even providers every few months (e.g. if the monitoring shows that certain gateways are attacked more often and w/ more bandwidth). In the end, the efficiency of network offense as well as network defense heavily depends on the skillset and creativity of the red and the blue team respectively. Variability and flexibility have always been and always will be an essential part on the road to success, be it for natural species or the survival in a clearly overhyped but nonetheless unambiguously fought cyberwar. From my personal experience, and if you generally look into the successfull spread of lots of things, be it historically relevant inventions or open source software, simplicity is often the key element of consecutive efficiency and widespread usage.

L7 Hardening: Anti-Bruteforce

No matter which services you run – it will not take long until somebody or something will start bruteforcing them. Instead of manually constructing a network-based mechanism like using netfilter string matching combined w/ ipt_recent, it might probably make sense to use what we already have and which does the same: fail2ban.

So, as a simple example, lets deal w/ wordpress login bruteforcing. If we look into the server logfiles, relevant entries will contain:

"POST /wp-login.php HTTP/1.1"

So now simply extend fail2ban to include that by first creating


and filling that w/ the following if it fits your site’s structure:

 failregex = ^<HOST> .* "POST /wp-login.php

Now just include the new configuration to the (hopefully) already existing


by adding

 enabled = true
 filter = wp-auth
 action = iptables-multiport[name=wp-auth, port="http,https"]
 logpath = /var/www/log/error.log
 bantime = 1200
 maxretry = 5

If implemented properly, we just need to restart fail2ban and it should mention the new rule by

2017-10-12 17:39:16,554 fail2ban.jail [7910]: INFO Creating new jail 'wp-auth'
2017-10-12 17:39:16,554 fail2ban.jail [7910]: INFO Jail 'wp-auth' uses pyinotify
2017-10-12 17:39:16,593 fail2ban.jail [7910]: INFO Jail 'wp-auth' started

If underlying principles are well understood, protecting other – not necessarily web-based – services should not be a hard task.

Basic IP Recon

Out of curiosity, it might be quite interesting to find out where the logged WordPress login bruteforce attacks (in my case, about 150 in only a few hours) originate from. So, lets first write a very basic skript to extract relevant data from our fail2ban.log:

grep 'WARNING \[wp-auth\]' /var/log/fail2ban.log
exit 0

This will printout all the bans as well as the unbans which take place 20 minutes later if the configuration is left in its default state. Now, lets process that data a bit more to first reduce it to relevant content, eliminate double entries, and finally try to lookup the IP adresses involved. A simple approach could look like:

sudo ./ | grep Ban | cut -d " " -f 7 | sort | uniq | nslookup | grep "name ="

and gives us quite some valid information. Mostly originating from .ru and .cn,  perhaps some .jp and .tr, this is quite the usual background noise, of which only 4 look a bit uncommon: name = name = name = name = no-data.

Lets checkout the WHOIS information for each of them:

inetnum: -
netname: HA-ZZ-USAT-LTD
country: CN
descr: Henan University Science And Technology Limited Company,
descr: No 7 Dongqing Road,
descr: Zhengzhou City,
descr: Henan Province.

Okay, a university network. Like back in the old days 🙂 The next one:

inetnum: -
country: CN
descr: HUANJBHJ Gov,
descr: SSDDYEBH,
descr: ZhengZhou City,
descr: Henan Provice.

Hmm, the government….and the last one

inetnum: -
netname: UNICOM-HA
country: CN
descr: China Unicom Henan province network
descr: China Unicom

is at least in my experience seen very often in any of portscan, spam, or bruteforce attacks. Now to the last highlighted one:

inetnum: -
netname: CNPC-TJ
country: CN
descr: CNPC Dagang Oilfield Communication Corporation

Never heard something like this before – could be interesting if that is really some kind of measuring device or a “normal” PC. Since it got 1723/tcp open, I suspect the former. Also, the question always remains: Are these attacks really originating from these adresses or are they just backdoored boxes?

If we do a quick portscan, all four IPs got one thing in common:

9999/tcp open abyss syn-ack

and a quick search reveals that it might be a remote access trojan called “The Prayer”.

To checkout all hosts for that backdoor, we can simply do s/t like

for i in `sudo ./ | grep Ban | cut -d " " -f 7 | sort | uniq ` ; do nmap -p 9999 $i --host-timeout=1 | grep open -B 3; done

DBM: Live Database Migration

Note: This howto uses percona-xtrabackup and percona-tools which can be downloaded for free. I also recommend to take some backups in between the following steps in case you are missing something. Also, closely watch error logs.

When executed correctly, the following commands should result in only ~ 1 minute downtime of the productive database.

The current setup consists of a ~ 280GB MySQL 5.5 DB including a ~ 130GB ibdata file w/o any form of replication @ our original database host db0.

What we want to have in the end, is a ~ 150GB MySQL 5.7 DB @ the new productive hosts db1 as well as the replication slave for backup and analysis @ db2.

Before we start, we implement binary logging and enable master configuration @ the currently productive db0 by setting the following, then restarting the db:


We begin w/ a  live backup of the running database @ db0 onto db1:

d1g@db0:~$ time sudo innobackupex --defaults-file=my.cnf --user root --password=xxx --host= --stream=tar ./ | gzip | ssh user@db1 "cd /DATA ; tar -izx"

Afterwards, at db1, note the values from the file xtrabackup_binlog_info then apply the logs to prepare the data for usage @ db1:

d1g@db1:/DATA$ innobackupex --apply-log /DATA

Alright, we can now do the upgrade from 5.5 to 5.6:

d1g@db1:/DATA$ sudo mysql_upgrade -u root -p -h

It is time to install MySQL 5.7 and do the Upgrade from 5.6 to 5.7:

d1g@db1:/DATA$ sudo mysql_upgrade -u root -p -h

Optionally and if we want to get rid of a large ibdata file, export the current users db:

d1g@db1:/DATA$ pt-show-grants -uroot -p > users.sql

Then backup the current DB and replace it by a fresh and empty data directory, manually create the needed db scheme and copy only the .ibd and .cfg files from the old datadir, then implement the previously saved users.sql:

d1g@db1:/DATA/db# for i in ibd cfg ; do time cp -vRp /home/d1g/DBB/db/*.$i . ; done
db1|mysql> source users.sql;

After some manual fixes and import, we can now start optimizing the DB:

d1g@db1:~$ time mysqlcheck -u root -p -o db

We have the data ready and consistent for MySQL 5.7, so let’s setup db1 as master of db2 and slave of db0 and also explictely take care that it writes the logs needed for later usage of db2 for slave replication:


It is very crucial to set the GRANT permissions right for db1 to replicate all the data that was created in the meantime on db0:

db0|mysql> GRANT REPLICATION SLAVE ON *.* TO 'repl'@'IP_OF_db1' IDENTIFIED BY 'password';

After this is done, we tell db1 to change to the db0 as master using the correct values from the xtrabackup_binlog_info and start the slave:

db1|mysql> CHANGE MASTER TO MASTER_HOST='IP_OF_db0', MASTER_USER='repl', MASTER_PASSWORD='xxxx', MASTER_LOG_FILE='mysql-bin.000002', MASTER_LOG_POS=643929298;
db1|mysql> START SLAVE;

Alright, db1 should be a replication slave for db0 and we have reached the first major milestone. Nice!

So now for the fun part which can be done in less than a minute: Configure db1 to be the new productive master having db2 as its replication slave, which is really no big deal: First, be sure that no new data is written to db0 by stopping the server. Then take note of the last values that db1 has received and stop the slave configuration:

db1|mysql> show master status;
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
| mysql-bin.000026 | 469687058 | | | |
1 row in set (0.00 sec)

db1|mysql> STOP SLAVE;

We need to permit db2 to replicate from db1:

db1|mysql> GRANT REPLICATION SLAVE ON *.* TO 'repl'@'IP_OF_db2' IDENTIFIED BY 'password';

Also setup db2 to be ready for the slave setup:


Now, as done before, use that last master information from db1 in db2 to become a slave  and start the replication on these values:

db2|mysql> CHANGE MASTER TO MASTER_HOST='x.x.x.x', MASTER_USER='repl', MASTER_PASSWORD='xxxx', MASTER_LOG_FILE='mysql-bin.000026', MASTER_LOG_POS=469687058;
db2|mysql> START SLAVE;

That’s it. Now adjust DNS or whichever settings for your frontend to use the new db1 instead of db0, and also be sure to remove the now obsolete from db1 so it does no longer try to reach it.

L7 Hardening: Security Headers

There are quite some directives at hand that can be added to your webserver configuration to achieve hardening against many attacks. Most websites – even those that really should – do not care, and thus receive a grade F when being checked by

It is pretty straightforward to change that completely. In nginx 1.6.2, just edit the site’s config file and insert:

add_header Strict-Transport-Security "max-age=31536000";
add_header X-Frame-Options "SAMEORIGIN";
add_header X-Xss-Protection "1; mode=block";
add_header X-Content-Type-Options "nosniff";

The equivalent in nginx 1.8+ would be:

add_header Strict-Transport-Security "max-age=31536000";
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Xss-Protection "1; mode=block" always;
add_header X-Content-Type-Options "nosniff" always;

This already gives us a grade C, but there is another powerful mechanism: the content security policy (CSP) is restricting the abilities of the browser to those predefined by you, especially only allowing certain servers to serve certain elements of the site’s content in the first place.

So lets take a closer look into this basic rule:

add_header Content-Security-Policy "default-src 'self'; connect-src 'self'; img-src 'self'; script-src 'self' ; style-src 'self' 'unsafe-inline' ";

This is restrictive and works only on static websites not involving any other sources for images, scripts or fonts.

As in most if not all cases when dealing w/ security, all this also involves the well known, eternal conflict: security vs. usability.

For example, webmail applications and underlying plugins often include inline javascript and thus need a bit less of restrictions expressed by

add_header Content-Security-Policy "default-src 'self'; connect-src 'self'; img-src 'self'; script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline' ";

WordPress for example would require restrictions similar to

add_header Content-Security-Policy "default-src 'self'; connect-src 'self'; img-src 'self'; script-src 'self' 'unsafe-inline' 'unsafe-eval'; style-src 'self' 'unsafe-inline'; font-src 'self' data:";

The most efficient way to implement a valid CSP for your website and/or application is to use the debugging shortcut F12 in the browser of your choice and check the console for relevant messages while at the same time creating parameters for your CSP that fit the actual operational needs.

AVSx: Hardening the SPAM Perimeter

In the course of a recent AVSx rollout, we had the opportunity to mitigate the serious SPAM problem of a customer. This included analyzing the specific situation, considering different approaches to eliminate or at least minimize known issues and also involved a detailed measurement of the spam statistics over time, ultimately leading to a short whitepaper which is yet to be published. One important element of our first approach is to not depend too much on external (or internal) servers at runtime, so that the solution is pretty much self-contained and thus working autonomously.


Pre-Rollout Situation

The customer receives thousands of spammails per day, many of which get forwarded to the internal mailserver. Since at that internal mailserver, no catchall account is defined (the sub-contractor says that this is not possible), it occurs that non-delivery-reports (NDR) are sent out. And lots of them. This has also led to the problem that the customer itself even got blacklisted in recent months by this well-known backscatter problem. Also, there was heavy usage of black- and whitelists, which may have influenced the overall situation in a negative way as well, since that badly interferes w/ bayes filtering and the overall learning process if not handled carefully.


Improvement v1: Static LDAP

There are multiple ways to reduce the amount of bad e-mail, but the most commonly used  overall practice is to make use of spamassassin and clamav. Having these tools at hand together w/ a good mailserver S/W like postfix, things are to be tuned in a very efficient way.

Postfix hardening already brings quite some mechanisms. Using block- and blacklists is another approach which should not be used from the start on, because the spamfilter might not get trained if nearly no mail reaches the server itself. Imagine a non-trained spamfilter if the blacklist is unreachable. In my eyes, it is crucial to closely survey the situation of spam occurence and scores as a basis to create statistics. This again serves as a base for the definitition of the SPAM tag- and kill-levels, which should normally be set to 4.5 and 16 respectively. When analyzing the score distribution, we can identify a peak which should ideally be not too far left of the kill-level, or in contrast, on a well trained system, even right of the kill level, which means that already most mails get discarded.

But what about all the NDRs? We can query the internal LDAP server to get the information we need. So, at first, installing ldapvi makes sense. Then, if we have valid credentials, we can simply do

ldapvi -b "ou=xxx, dc=xxx, dc=xx" -h v.w.x.y -D you@yourdomain > ALLDATA.ldap.txt

and then get a list of valid e-mail adresses out of this by doing

grep "yourdomain" ALLDATA.ldap.txt | grep mail | cut -d ":" -f 2 | cut -d " " -f 2 | tee -a /etc/postfix/relay_recipients

This list of emails is the base for the definition of a relay_recipients file, which will accept mails only if the corresponding adresses are also found in the LDAP directory.

This eliminates the complete problematic NDR situation previously found, b/c the internal mailserver normally never has to state that mail is sent to a user w/o a corresponding mail: entry in the LDAP directory dump. In a static LDAP scenario, if an account gets removed or the internal mailserver itself has problems, would produce a NDR. One could query the LDAP DB once a month and recreate the relay_recipients file from that, but in general I guess manually changing/adding/removing users in the file makes more sense, and again, we do not want to depend too heavy on other servers. 

To use the relay_recipients file, the following parameter should be set in the postfix config

d1g@isp:~$ sudo postconf relay_recipient_maps
relay_recipient_maps = hash:/etc/postfix/relay_recipients

The maintenance of the valid recpients – if needed at all – could optionally be done by the local admin if we have a (cron-)skript in place that rebuilds the DB regularly by doing e.g.

postmap -v /etc/postfix/relay_recipients
postmap: name_mask: all
postmap: inet_addr_local: configured 2 IPv4 addresses
postmap: inet_addr_local: configured 2 IPv6 addresses
postmap: set_eugid: euid 1000 egid 1000
postmap: open hash relay_recipients
postmap: Compiled against Berkeley DB: 5.3.28?
postmap: Run-time linked against Berkeley DB: 5.3.28?

In this static LDAP setup, the mailserver accepts mail only for previously defined valid users, and simply by this already rejects a very high percentage of all spam mails, especially all that produced the problematic NDRs.

Improvement v2: Dynamic Lookups

While solely relying on dynamic lookups might get you into trouble as described above, having dynamic lookups only in cases where the recipient is not found statically makes sense, and also brings any change made in the LDAP directory immediately to the outside world. In order to enable the dynamic lookup feature, we have to first install postfix-ldap, and then define

relay_recipient_maps = hash:/etc/postfix/relay_recipients, ldap:/etc/postfix/

The tricky part is the file itself, as in our case it had to contain

server_host = x.x.x.x
search_base = ou=xxx, dc=xxx, dc=xx
version = 3
timeout = 10
leaf_result_attribute = mail
bind_dn = user@domain
bind_pw = userpassword
query_filter = (mail=%s) 
result_attribute = mail, addressToForward

Afterwards, restart postfix, and/or optionally test the setup by doing

postmap -vq user@domain ldap:/etc/postfix/

So, we now have both a static and dynamic mechanism in place, which makes the system rather failsafe and ready for immediate LDAP directory change propagation.

Last but not least: Keep in mind – if a valid user is listed in LDAP, but the corresponding mailbox is not available for whatever reason on the local mailserver, non-delivery receipts (NDR) will be sent out!

Improvement v3: Query Proxy Addresses

In some cases, the ldap query has to be adjusted to the given scenario:

server_host = x.x.x.x
search_base = ou=xxx, dc=xxx, dc=xx
version = 3
timeout = 10
leaf_result_attribute = mail
bind_dn = user@domain
bind_pw = userpassword
query_filter = (proxyAddresses=smtp:%s) 
result_attribute = mail, addressToForward

After a restart of postfix, the mechanism works as intended.

S/W issues: Apache 2.4/mod_fcgid

During a recent hardening/maintenance session,  the httpd was upgraded to the 2.4 version, which initially produced lots of php segfaults, most probably  b/c content and configs were not sufficiently adjusted by the admin(s) beforehand.

After some config + content adjustments during a 2nd maintenance session, the setup ran quite well, but during periods of high load, lots of different error messages were shown. Different types that kept our attention were:

[Sun Mar 05 23:39:40.008981 2017] [fcgid:emerg] [pid 6608:tid 139850686646016] (35)Resource deadlock avoided: [client x.x.x.x:48916] mod_fcgid: can't get pipe mutex
[Sun Mar 05 23:43:43.343519 2017] [fcgid:warn] [pid 7829:tid 139850577540864] (104)Connection reset by peer: [client x.x.x.x:49693] mod_fcgid: ap_pass_brigade failed in handle_request_ipc function, referer: ....
[Mon Mar 06 00:03:53.432715 2017] [fcgid:emerg] [pid 8618:tid 139850669860608] (35)Resource deadlock avoided: [client x.x.x.x:47173] mod_fcgid: can't lock process table in pid 8618, referer: ....

Also, the server had at least 20 php zombie processes running which cannot get killed but exhaust ressources and pile up:

1745 ?        Z      0:06 [php-cgi] <defunct> 
1753 ?        Z      0:06 [php-cgi] <defunct> 
3340 ?        Z      0:09 [php-cgi] <defunct> 
3341 ?        Z      0:09 [php-cgi] <defunct> 
3509 ?        Z      0:02 [php-cgi] <defunct>

So,  it seemed to have something to do w/ the way mod_fcgi starts php and locks underlying processes.

After a little research here and there, a checkout of how the system is configured gave us:

user@host:/opt/apache2/conf# grep -i mutex */*
extra/httpd-ssl.conf:SSLMutex "file:/opt/httpd-2.2.29/logs/ssl_mutex"
extra/httpd-ssl-domain.conf:Mutex file:/opt/apache2/logs/ssl_mutex

Hmm. One config even points to the wrong directory, and both use the “file” fcntl locking mechanism which seems to initially cause the errors.

Possible solutions would be to use

Mutex flock:${APACHE_LOCK_DIR} default

or as recommended

Mutex sem
SSLMutex sem

instead. Also, if the “sem” config-switch produces new errors, it may be b/c the kernels semaphore arrays are limited and should be extended by

sysctl -w kernel.sem="250 32000 32 1024"

So, just to be on the safe side, let’s do both, wait for a good timeslot to maintain, and reload the new config by service apache reload or via init which uses apachectl anyways:

/etc/init.d/apache reload

Great, all php zombie processes also vanished:

www-data 16178 0.9 0.2 95092 21788 ? S 02:41 0:00 /opt/php5/bin/php-cgi -c /opt/apache2/conf
www-data 16505 1.8 0.2 95100 21628 ? S 02:41 0:01 /opt/php5/bin/php-cgi -c /opt/apache2/conf
www-data 16685 3.9 0.2 93856 20480 ? S 02:41 0:03 /opt/php5/bin/php-cgi -c /opt/apache2/conf
www-data 16819 3.7 0.2 95104 21448 ? S 02:41 0:03 /opt/php5/bin/php-cgi -c /opt/apache2/conf
www-data 16846 1.7 0.2 95364 21692 ? S 02:41 0:01 /opt/php5/bin/php-cgi -c /opt/apache2/conf

The rest now seems to work as well, no more error messages thrown so far! Wait – after a while, one error reoccurs, and this very last section should be pretty self-explanatory:

#soll wohl auf 500 stehen
# d1g 060317
#FcgidMaxRequestsPerProcess 500
FcgidMaxRequestsPerProcess 0

Ethernet Issues v2: Retransmits

A few days before the time of this writing, certain problems arose at a high traffic (2.5 TB/day) and high load (50-75%) site.

What were the symptoms? The original settings ultimately led to TCP Retransmits, slowed down the specific users requests by aproximately 67 seconds on MacOS X, tho only by a few seconds in windows which maybe made the problem so hard to spot in the first place.

However,  LTE connections and other locations home networks seemed to have been unaffected by the issue, and an older linux router which seemed to eventually have compensated the mentioned problem until recently again made it hard to spot.

Normally, it is the other way around, but this time, a skilled customer took a deeper look into the issue (thnx, alex!) and gave us a hint having identified the probable involvement of the tcp_tw_recycle setting which can be checked by e.g.

# cat /proc/sys/net/ipv4/tcp_tw_recycle

What it does:

“Enable fast recycling of sockets in TIME-WAIT status. The default value is 0 (disabled). It should not be changed without advice/request of technical experts.”

Having mentioned tw_recycle, two other relevant settings are tcp_tw_reuse  and tcp_max_tw_buckets  which seems to be set to 262144 on newer and to as low as 16384, 65536 or 131072 on older systems. So yes, again there could be a connection to a high traffic site having roughly 77k connection states for some while, correlating w/ symptom appearance.

So, just to be safe, I would recommend setting it to a high value by

echo 262144 > /proc/sys/net/ipv4/tcp_max_tw_buckets

One could think that lots of TIME-WAIT states only arise if you have many clients w/ (too) idle connections coming from one NATed network, and that under normal circumstances the fast recycling may make sense. However, with the connection tracking table max. entries set to let’s say 512k, most systems perhaps never need to recycle the WAIT-STATES.

In the end, lets not forget about the fact that if we find problems, some trouble may have already taken place, but this is also the base for fixing it in most if not all cases.

OpenPGP Key Recreation and Revocation

Despites nearly having forgotten to blog about it, time has come to get myself a stronger OpenPGP keypair. But what about the folks I already established a secure connection with using the old key 0x800e21f5 and what about the rest of the internet? It’s not as complic as one might think.

Note: The following Howto is meant for more advanced users and involves the shell. Novice users should just continue to use roundcube for all their PGP related things.


1. Key Creation

Key creation is very simple if you use GnuPG on Linux:

0x220b:~$ gpg --gen-key

You can leave the default options (RSA/RSA,  4096bit, never expires) until it comes to name, e-mail and comment, where you should fill in your personal data associated w/ the use of the key. In most cases, one e-mail address is not enough, but you can just add one like this:

0x220b:~$ gpg --edit-key 6C71D217
gpg> showpref
[uneingeschränkt] (1). Peter Ohm (NetworkSEC/NWSEC) <>
 Verschlü.: AES256, AES192, AES, CAST5, 3DES
 Digest: SHA256, SHA1, SHA384, SHA512, SHA224
 Komprimierung: ZLIB, BZIP2, ZIP, nicht komprimiert
 Eigenschaften: MDC, Keyserver no-modify
gpg> adduid

Now enter the other e-mail addy and a relevant comment if you wish. So, we now got a fresh key – but what about the old one(s)? At first, we should use them to sign the new one:

0x220b:~$ gpg --default-key 800e21f5 --sign-key 6C71D217
0x220b:~$ gpg --default-key 7BB7A759 --sign-key 6C71D217

and then finally give everybody access to our new public key by:

0x220b:~$ gpg --keyserver --send-key 6C71D217


2. Key Revocation

Okay, now everybody must be able to know that the old keys are not used any longer. This can easily be achieved by first creating a revocation certificate for each of them, then importing that into the own keyring and finally exporting the revoked keys to the internet. Lets do it w/ a small shell skript and gpg2:

for i in 7bb7a759 800e21f5
 gpg2 --output revoke.asc --gen-revoke $i
 gpg2 --import revoke.asc 
 gpg2 --keyserver --send-keys $i


3. More Key Distribution

I also recommend to send everybody you already set up an encrypted communications channel with your new public key as they will be the only ones possibly using the old key material (most OpenPGP clients refuse to use revoked keys for encryption) and as it’s especially them who would need to be informed about any changes.

So, even if anybody interested in establishing a secure communications channel did not yet get your new public key, all that remains to be done is:

gpg --keyserver --recv-keys 6C71D217
gpg --keyserver --refresh-keys

…and don’t forget to attach your own public key if its a first-time contact.