Continuity notes -- outage 2023-10-24

2023-10-23 13:11:45 -05:00 · 2023-10-23 13:11:45 -05:00 · 1bb8f0f9a3
commit 1bb8f0f9a3
parent d876823f0b
4 changed files with 44 additions and 3 deletions
--- a/Articles/The_Complicated_Cloud.md
+++ b/Articles/The_Complicated_Cloud.md
@ -47,9 +47,9 @@ We use Google for a few things.

 Effectively, Google services here are handling all the legacy cruft for us in dealing with the external world. These services are typically more difficult to secure, though they are more familiar to average users.

-## Stripe: Payment and PCI compliance
+## Venmo: Payment and PCI compliance

-[PCI compliance](https://www.pcisecuritystandards.org/pci_security/completing_self_assessment) is a necessary part of doing business within the US. This is presently more impactful for our [martial arts](/martialarts) division than the tech one, but it's still necessary to support. We host links to PCI sites, so we have to annually review a self-assessment, but our obligations are limited. It would be possible for us to develop a complete payment portal against a banking institution ourselves, but because we are not a bank, we'd still be dependent on that bank's cloud services and API's. Such development would also make us liable for more expenses in needing to hire a PCI auditor and other overhead we simply cannot afford. As such, we offload our payment system by linking out to [Stripe][stripe] which directs payment into our bank.
+[PCI compliance](https://www.pcisecuritystandards.org/pci_security/completing_self_assessment) is a necessary part of doing business within the US. This is presently more impactful for our [martial arts](/martialarts) division than the tech one, but it's still necessary to support. We host links to PCI sites, so we have to annually review a self-assessment, but our obligations are limited. It would be possible for us to develop a complete payment portal against a banking institution ourselves, but because we are not a bank, we'd still be dependent on that bank's cloud services and API's. Such development would also make us liable for more expenses in needing to hire a PCI auditor and other overhead we simply cannot afford. As such, we offload our payment system by linking out to [Venmo][venmo] which directs payment into our bank.

 We are investigating using a USDCoin wallet to offer operating on the blockchain, but that is still a weird middle ground of self-hosting and cloud all at the same time, being a peer-to-peer protocol. One could argue that running a miner for that protocol would make it somewhat self-hosted and that we are simply participating in the protocol with a much wider audience in the same way that providing an RSS feed puts us in the conglomeration of information provided by RSS. However, adoption for this is still low and more traditional banking will likely dominate any business ventures in the near future.

@ -132,7 +132,7 @@ Self-hosting is still the best route, we believe, for your organization to contr
 [analytics]: https://analytics.google.com
 [domain]: https://domains.google.com
 [voice]: https://voice.google.com
-[stripe]: https://stripe.com
+[stripe]: https://venmo.com
 [freshping]: https://aninix.freshping.io/
 [zapier]: https://zapier.com
 [discord]: https://discord.gg/2bmggfR
--- a/Operation/Continuity.md
+++ b/Operation/Continuity.md
@ -0,0 +1,27 @@
+Operational concerns are as important as development or security ones. Having a good layering of operational continuity will allow services to continue while the network operators follow the [incident response guidelines](./Incident_Response.md). Users should understand the posture outlined here to know how an incident will affect them.
+
+# Self-recovering Services
+
+Ideally, services should be designed & developed to recover themselves. This is the second-best option to not having problems in the first place. Tools like systemd can restart services, or monitoring can identify & remediate issues. This kind of automation can sort out issues before users or admins know about them.
+
+# High Availability & Geodiversity
+
+High availability can allow inconsistently failing nodes to not take the service down with them. If one node fails, the traffic will get routed to the next one so that users don't see issues. Admins can get the notification and sort the problem out before users even see the issue. Tools for this can be in webservers or appliances like F5 load balancers.
+
+Geodiversity allows some kind of resilience against environmental issues. One needs tools like round-robin DNS or eBGP to broadcast the fallback sites, but if an ISP suffers a line cut or the site endures a natural disaster (or planned maintenance), traffic will fall over to the next site. This can be a cost issue, since the deployment needs to decide the cost model. If any site can handle peak load, then the organization is wasting compute & power that's not doing work during normal operation. If any site can handle median load, peak will get handled by both nodes but it saves some cost during normal operation. If both sites are needed to handle peak activity, then you will see a service degradation during an event but this will be the most fiscally conservative option. Don't design services to only handle median load.
+
+This option is not currently available to us, as we don't have a second site for peering.
+
+# Disaster Recovery
+
+Disaster recovery is responding to terrible issues that can't be caught by the prior two solutions. This includes options like Infrastructure-as-Code, backups, and AniNIX/Aether, that provide various options for rebuilding services during an event. DR procedures are critical for resolving ransomware.
+
+# Business Continuity
+
+Business continuity operation is perhaps the most critical to AniNIX operations, since it allows the best options when issues take long enough to resolve that a user will notice. AniNIX/Yggdrasil, AniNIX/Foundation, and AniNIX/Singularity allow offline options for when the services aren't available but still allow users to use the content. Other services, like AniNIX/WolfPack or AniNIX/Maat, are convenience and if they aren't available users have the option to wait before using them. Discord is currently providing our fallback for IRC.
+
+Core business continuity procedures:
+
+* Maintain local clones of any AniNIX/Foundation projects you're working on.
+* Use the "Download Media" option in the Emby web interface for AniNIX/Yggdrasil
+* AniNIX/Singularity's TT-RSS mobile app has a "work offline" feature -- this will let the user look through the last set of articles the app downloaded.
--- a/Operation/Incident_Response.md
+++ b/Operation/Incident_Response.md
@ -17,6 +17,10 @@ An [incident][2] is an unplanned event affecting a service or agency, where a di
 # Required Follow-ups
 See TOGAF, COBIT, and ITIL standards for design methods for incident response. Also available is documentation from [NIST](https://duckduckgo.com/?q=NIST+Creating+security+plans+for+federal+information+systems&ia=web) on how to formulate security plans.

+## Monitoring and Alerting
+
+Network operators should follow the `#sharingan` IRC & Discord channels for a redundant method of being alerted when there are issues. While we have no mean-time-to-acknowledge (MTTA) or mean-time-to-recover (MTTR) service-level agreements (SLA), on-call operators should attempt to respond with all available expediency to issues.
+
 ## OSINT feed

 Significant non-disruptive incidents detected by [AniNIX/Sharingan](https://sharingan.aninix.net) will be recorded as part of our [OSINT feed](https://aninix.net/AniNIX/Wiki/raw/branch/main/rss/osint.xml). This feed is intended to be a public service to help improve the general community. Those watching this feed are encouraged to examine their own incoming traffic for the adversaries listed and take appropriate protective action.
--- a/rss/aninix.xml
+++ b/rss/aninix.xml
@ -11,6 +11,16 @@

    <id>https://aninix.net/</id>

+    <entry>
+        <title>Outage 2023-10-24</title>
+        <link href="https://aninix.net/aninix.xml#20231024"></link>
+        <updated>2023-10-24T04:09:00Z</updated>
+        <id>https://aninix.net/aninix.xml#20231024</id>
+        <summary>
+                We will have an extended outage 2023-10-24 0700 US Central until late in the evening, as our primary site is undergoing construction. Please watch #tech on Discord fo r tracking service recovery. During this time, please fall back on <a href="https://aninix.net/AniNIX/Wiki/src/branch/main/Operation/Continuity.md#business-continuity">continuity procedures</a> to keep access to services provided by the AniNIX.
+        </summary>
+    </entry>
+
    <entry>
        <title>OSINT Feed</title>
        <link href="https://aninix.net/AniNIX/Wiki/raw/branch/main/rss/osint.xml"></link>