Category Archives: SPLAT

SPLAT ideas

SSH to gateway cluster hangs – Finally fixed!

Oh this is most bizarre.

All my CheckPoint life I noticed that when you ssh to the standby member it will hang for 30 seconds.  I actually figured out long ago that it was a DNS problem. Member B was sending reverse DNS queries and the DNS request was getting translated to the cluster IP address. When the response came back, the active member (NOT the standby mem ber )was dropping the response because the standby sent it out not the active.

I’ve been tooo lazy to fix every firewall with a NAT rule. But someone showed me this cool but bizarre trick.

  1. In your cluster configuration for clusterXL, select VRRP instead of clusterXL.
  2. Uncheck/Clear the Hide Cluster members outgoing ……..
  3. Set VRRP BACK!!! to clusterXL
  4. Push policy

 

DNS hide behind cluster IP

 

Waaaaalllaaaa! DNS and ssh now works.

Just sniff DNS traffic on both members to verify. NOTE: the tcpdump is wrong on the source IP going OUT but the replies make sense.

Cool huh??

Make sure this doesn’t screw with your OSPF/routed or other gateway initiated traffic because remember all gateway initiated traffic is now from the member IP and not the cluster IP.

NAT away!!!!!

dreez

 

GAIA and NTP vs SPLAT

NTP has changed on GAIA. SPLAT had its own convoluted cpd_sched daemon to schedule NTP polls. GAIA is using the normal Linux NTPD daemon to do its work. Thank you!

If you need to dink around with NTP, you use the standard Unix commands:

ntpdate -u <IP address> # force a sync
ntpq -pn # displays the poll times in seconds, if all columns are ‘0’ you are not syncing

NOTE that if you dink with /etc/ntp.conf, you better write protect it so GAIA does not overwrite it when it boots.

Now to reset the ntp poll times, you can still use

# set to 1 hour 3600 seconds

ntp -n 3600 172.17.1.2

and a file will be created in /etc/sysconfig/ntp that contains this config into. This file survives reboots and is used to reset the NTPd parameters. This command will enter the ‘ntpdate’ command into the cpd_sched_config list to run every 3600 seconds and poll for NTP time.

Debugging Confwiz

Started to use confwiz instead of the usual migration tools for migrating domains. Works pretty good. Only problem is when there are problems, not too obvious where it blows up.

This was the magic I used to debug confwiz:

alias df=’fwm debug fwm off TDERROR_ALL_ALL=5′
alias do=’fwm debug fwm on TDERROR_ALL_ALL=5′
mdsenv <cma>
mcd
cd log

do

<import the database>

fgrep ‘E R R O R’  fwm.elg.*

<look for missing variable>

<create missing variable in smartdashboard>

Penalty box for DDOS attacks

Saw this today. I think Watchguard has had this for years. About time.

Too bad its command line only now.

http://blog.lachmann.org/?p=1723

Performance Boosts

I learned this at CPX.

So you have 1 gateway that is a total dog and you are looking at GAIA and that sleek 64-bit kernel handling 50 gajillion magawatts of power.  Drool Drool Drool.

But now you have to upgrade. Oh geez, now I have to upgrade my management. Oh geez, new OS, so now I have to train people. UGH!!!! And then it blows up on the launch pad. Oh Oh.

Here are some tips from the backroom.

1) Look at your high performance, INSPECT intensive traffic: HTTP? H323?: Well everytime that packet hits a “ANY” service the kernel goes through the WHOLE list of INSPECT services to see if it should INSPECT it to death. This takes time, memory, etc.

Instead create a special rule for that service AND on the “advanced service” tab, remove it from the “Remove from Any” rule.

Whaaala: Our client went from 1G/sec throughput to 9.5G/sec throughput

2) NOTE: Global-Properties->Statefull Inspection->Timeouts: Crank these down to create more space in the connection table. Note that the specific SERVICE timeouts override the Global Properties ones.

Whaaaalaaa: You can put off GAIA a couple more days.

dreez

The Death of the Containers

I’m declaring Jan 24, 2012 an official MDM mourning day. I found out today that it is official they killed MDM containers in R75. So sites that have more than 250 domains….good luck. I am trying to figure out why? and what are Enterprise customers suppose to do? This is the beauty of MDM, the scalability.

I hear that you can support 500 on 1 R75.20 MDM, but I see sites die at 250. Memory pig. I’m guessing we will have to wait for 64-bit Gaia to put larger MDM environments on top of it.

More gossip just in. I hear 250 is the limit and MDS starts creaking at 100 domains. Depends on size of object database. (Guess you need Tufin to clean up those databases!!!).

Basics for any SPLAT debug geek

I thought this was a great debug cheat sheet. I’m going to be taking the Advanced Training Jan 9th-Jan 11th in Austin and I’m sure I’ll learn this all in detail and of course will share.

 

http://blog.lachmann.org/wp-content/uploads/2010/09/2010-CPUG-CON-Tobias-Lachmann-Check-Point-Troubleshooting.pdf

 

Red Alert: R75 Snapshots are broken

posted Nov 15, 2011 7:29 AM by Michael Endrizzi                                                                                               [

Just found out from 2 customers that R75 snapshots are broken..kinda. If you are upgrading from R75 to R75.20 and it fails (and it will), you can’t just interrupt the boot and revert the snapshot. For some reason you have to rebuild the box as R75 with any hotfixes and THEN do a revert. UGH.
NOTE: When I was at CP support in Irving one advanced support person was trying to fix a huge bank that had corrupt backups/snapshots. They couldn’t go forward or backwards.
NOTE: I took apart the snapshot .tgz and it had every file on the SPLAT disk. I really don’t know why they can’t restore the snapshot…..
Read my blog below on how verifying restores and pulling RAID drives.

Upgrades/Migrations

posted Oct 14, 2011 8:05 PM by Michael Endrizzi                                                                                               [

 

There is a reason I am losing my hair….upgrades and migrations…
Many people think these are as easy as putting in the CD and typing “patch add cd”…play with your iPhone for an hour… reboot and go to the bar and celebrate.
(Great job by Pat Waters, Thanks Pat)
(notice he does a snapshot/revert!!! Wise Wise man)
Did you ever watch those house remodeling shows that make it all look soooo easy? Same here, get your Home Depot credit card ready…
New installs are easy..People don’t have Internet then after you install they have Internet and you are the hero. Not so with upgrades. People HAVE Internet access and if the upgrade goes south people DON”T HAVE Internet access and you are the a$$ that destroyed their business (they obviously were in bed at 3am while you were working).
Upgrades are easy until they go south, then upgrades are really nasty. People assume the upgrade scripts are magical and will catch every weird configuration issue. NOT!!!! And of course you are upgrading in a live environment with a 2.5 hour window. Of course the perfect storm hits when you can’t go forward and you can’t go backward.
I’ve learned my lesson. I never approach upgrades with ease. The new Sidewinder firewall has this cool feature where it builds a virtual disk with the old environment. If anything goes wrong, just point to the virtual disk and wham bam your are back online.
So too I’ve learned my lessons with 18 hour fingernail pulling upgrades. Here are a couple options to keep hair on your head:
1) Have a duplicate hardware platform that you build on and swap it in until you are convinced it works. If not, then quickly swap in the old platform.
2) Snapshot/restore: Anyone not doing this should be fired. Also they should be castigated if they are not testing the restore to ensure the integrity of the snapshot. WHY?? Because something is not working/different in R75 where you can’t restore to a previous version! How would you like to run into that during an aborted upgrade.
3) My favorite is RAID 10 – pull out the redundant drives and LABEL them “Version XXX DO NOT DELETE”. If possible. During the upgrade if anything goes wrong just slam in the backup drives, MAYBE do a SIC, and wham bam thank you maam, Yeah the RAID rebuild takes a bit, but you are probably doing this at 3am anyways.
4) No better reason for HA than now. Synch to the Secondary, Make the Secondary the Primary. Upgrade the old primary. If something goes wrong who cares?
4) Verification Test: Agree with main application owners what the verification test is. If you can’t run this test 1/2 way through the maintenance window EJECT, PUT DOWN THE SHOVEL-STEP BACK FROM THE HOLE,PULL OUT OF THE DIVE, RETREAT, SUCK UP YOUR PRIDE AND AVOID THE WATERLOO. If this test does run successfully, then if anything else goes wrong tell them to accompany you at 3am to fix it.
You are welcome….I just saved your career. I know these have saved mine…
In addition, I urge customers to consider a migration instead of upgrade. Its like upgrading XP to Windows 7…do you really want to pull all the viruses and 10 years of crap you downloaded into the future? NO! I urge people to build the newer version from scratch and then re-import the ruleset on a fresh install.
My gunslinger days are over. Besides having a Plan B, I document the process so others can replicate. As I learn gotchas OR new techniques they all get written down in the process document.
But in the end its too late. Most my hair is gone already. Why didn’t I learn earlier????
Damn.

Smart Splat

posted Oct 11, 2011 7:13 AM by Michael Endrizzi

Kinda cool terminal into SPLAT
Helen's Loom

"The most difficult thing is the decision to act, the rest is merely tenacity." -Amelia Earhart

Life Stories from Dreez

These are stories from my travels. Generally I like to write stories about local people that I meet and also brag about living the retirement dream with my #1 wife Gaby. She is also my only wife.