Base64 in python – Mr. Kelsey's Site

Python already includes a base64 library as part of the standard library. So why write our own?

I can offer two reasons. Writing your own will solidify your understanding of the base64 algorithm. Also, if you write your own, you can add in extra functionality that might not exist in the standard library version.

Believe it or not, this post is actually part of the RSA series. As per the requirements found in RFC 1421, PEM encoded keys need to be base64 encoded. More accurately, their ANS.1 DER encoded objects need to be base64 encoded. I’ll make a post about ANS.1 DER encoding at some point in the future.

For now, let’s look at base64 encoding. It’s a pretty simple system and offers the ability to turn any data into printable data. This is to say that, no matter what input you use, the encoded output will consist solely of the following 65 printable characters:

'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P',
'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f',
'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v',
'w', 'x', 'y', 'z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '+', '/',
'='

Some of you may be wondering why it is called base64 instead of base65 seeing as how there are a total of 65 possible characters. This is because the ‘=’ character is actually just a padding character and will only ever show up at the end of a base64 string. The code should help you understand how the padding is determined.

Once the original data is encoded, it can be transmitted, saved, r otherwise without fear that other programs or protocols will accidentally interpret the string as anything other than a string. The encoded character string can then be decoded back to the original data and used from there.

Let’s look at how we can code this out for ourselves.

I’ll start with a focus on transforming ASCII strings into base64 and back. Afterwards, I’ll add in some functionality to use hex strings instead of ASCII.

class B64:

    def __init__(self, user_input):
        self.ascii_printable = [
        ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/',
        '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?',
        '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O',
        'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^','_',
        '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o',
        'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~',
        ]
    
        self.base64_printable = [
        'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P',
        'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f',
        'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v',
        'w', 'x', 'y', 'z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '+', '/',
        ]
        self.user_input = user_input

    def encode(self):
        pass
    
    def decode(self):
        pass

base64_printable holds all the printable characters used to base64 encode data. ascii_printable holds all the symbols you’d find on a standard American English keyboard.

Before I move on, its worth talking a look at how ASCII looks to a computer. Remember, computers have no sense of anything beyond binary. ASCII characters are just 8 bit sequences mapped to the symbols we see on a key board. You can easily see this mapping by using your favorite search engine querying for ‘ascii table’.

If you use linux, you can run the command ‘man ascii’ and you’ll be greeted with the same mapping shown below:

For convenience, below are more compact tables in hex and decimal.

   2 3 4 5 6 7       30 40 50 60 70 80 90 100 110 120
 -------------      ---------------------------------
0:   0 @ P ` p     0:    (  2  <  F  P  Z  d   n   x
1: ! 1 A Q a q     1:    )  3  =  G  Q  [  e   o   y
2: " 2 B R b r     2:    *  4  >  H  R  \  f   p   z
3: # 3 C S c s     3: !  +  5  ?  I  S  ]  g   q   {
4: $ 4 D T d t     4: "  ,  6  @  J  T  ^  h   r   |
5: % 5 E U e u     5: #  -  7  A  K  U  _  i   s   }
6: & 6 F V f v     6: $  .  8  B  L  V  `  j   t   ~
7: ' 7 G W g w     7: %  /  9  C  M  W  a  k   u  DEL
8: ( 8 H X h x     8: &  0  :  D  N  X  b  l   v
9: ) 9 I Y i y     9: '  1  ;  E  O  Y  c  m   w
A: * : J Z j z
B: + ; K [ k {
C: , < L \ l |
D: - = M ] m }
E: . > N ^ n ~
F: / ? O _ o DEL

This shows the hex version on the left and the decimal version on the right. Let’s look at the decimal number 33. If you check the decimal chart on the right, you will see that the cross-section between 30 (top row) and 3 (left column) is the symbol ‘!’ : an exclamation point.

Decimal 33 is the same as 0x21 in hex. Looking at the cross-section between 2 (top) and 1 (left) of the hex table shows that it also maps to the same symbol, the exclamation point.

This is because both decimal 33 and hex 0x21 are different ways of looking at the same binary value: 00100001. This is what is actually being mapped to the the exclamation point symbol.

Now, look at the first three values in the 30 column of the decimal chart. All three look blank. However, this is not the case. Decimal 32 (0x20) is the space character. Decimal 30 and 31 are control characters and do not print a symbol. Thus, the printable characters begin at decimal 32 (0x20) and run through decimal 126 (0x7e).

One last thing to take note of is simply that the ASCII decimal table and my ascii_printable list are in the exact same order. I’ll let you ponder why that is so important based on the previous points.

Before I get back to coding, I want sure up your understanding of the base64 algorithm.

As we just went over, each ASCII character is mapped to a unique sequence of exactly 8 bits. base64, on the other hand, only require 6 bits as 2 to the 6th power equals 64. Thus, with just 6 bits, we have 64 unique sequences that can each be mapped to their own respective symbol.

Back to coding… Based on the information above, it should be apparent that the B64 class needs a handful of additional methods to enable encoding and decoding.

First, it needs to be able to derive the proper binary pattern for each character it is tasked to encode.

def ascii_to_decimal(self, char):
    for i, c in enumerate(self.ascii_printable):
        if c == char:
            return i + 32

def decimal_to_binary(self, number, bits):
    if bits == -1:
        return b""
    if number >= 2 ** bits:
        return b"1" + self.decimal_to_binary(number - 2 ** bits, bits - 1)
    return b"0" + self.decimal_to_binary(number, bits - 1)

I chose to do this in two steps because decimal_to_binary will need to be used for decoding as well. Thus, this keeps things more modular and reusable.

Next, B64 needs to break the binary stream into the proper sized chunks (6 bits in this case) and derive the proper decimal values from the new binary patterns based on each group of 6 bits.

def create_groups_of_binary(self, binary, group_size):
    groups = []
    start = 0
    stop = group_size
    while not stop > len(binary):
        groups.append(binary[start:stop])
        start = stop
        stop += group_size
    
    return groups

def binary_to_decimal(self, bin, bits):
    if bits == -1:
        return 0
    # python evaluates byte strings by each character's ordinal value, ord("0") == 48
    if bin[0] - 48:
        return 2 ** bits + self.binary_to_decimal(bin[1:], bits - 1)
    return self.binary_to_decimal(bin[1:], bits - 1)

Again, this is written for re-usability. Passing a group_size parameter of 6 sets up for base64 encoding. But it can also be used with a group_size parameter of 8 to go back to ASCII when used for decoding.

Now that the building blocks are all set for encoding, I’ll build out the encode method and take care of padding.

def encode(self):
    binary_message = b""
    for char in self.user_input:
        binary_message += self.decimal_to_binary(self.ascii_to_decimal(char), 7)
        
    padding = 0
    while len(binary_message) % 6:
        binary_message += b"0"
        padding += 1

    encoded = ""
    for group_of_binary in self.create_groups_of_binary(binary_message, 6):
        encoded += self.base64_printable[self.binary_to_decimal(group_of_binary, 5)]

    return encoded + "=" * (padding // 2)

Some of you may be wondering why I set the bits parameter to 7 in my call to decimal_to_binary and 5 in my call to binary_to_decimal as opposed to 8 and 6 respectively.

As with all number systems (base 10 inclusive) we begin counting at 0. If you don’t believe me, try counting by 2’s.

Did you say 2, 4, 6, etc.? Of course you did. Why not 1, 3, 5, etc.? Because you instinctively started your count at zero. You just may not have said it out loud.

Binary, as a number system, has only two digits: 1 and 0. Just like 10 is 10 to the first power and 100 is 10 to the second power in the base 10 system, 10 is 2 to the first power and 100 is 2 to the second power in binary. Thus, if the the leftmost digit of an eight digit string of binary is one, it represents 2 to the 7th power. Similarly, the leftmost digit of a 6 bit sequence represents 2 to the fifth power.

Hopefully, you can tell why padding is needed for base64 encoding. As discussed previously, every character used to encode base64 is represented by a 6 digit binary sequence. But the incoming ASCII symbols are represented by 8 bits.

3 x 8 is equal to 6 x 4. And that is great. It means that any length of message that is evenly divisible by 3 can be properly encoded into base64. But what happens if the input string length is not divisible by three? If the length mod 3 were one, then there would be two extra bits unaccounted for (8 – 6 = 2). If length mod 3 were 2, then there would be four bits unaccounted for ( 8 x 2 – 6 x 2 = 16 – 12 = 4). Therefore, both cases require padding characters to even up the bit count. Interestingly, the padding character = is representative of 2 bits. So the first case would require two padding symbols and the second case would only require one.

Now we can move on to decoding.

Decoding base64 is basically the same as encoding, just in reverse. The B64 object will need to strip the padding off the end and take note of how many bits to discard (2 bits per padding character). Then it needs to convert each base64 character into binary so that it can regroup the binary and get decimal values from those new group. Match the decimal values to the ASCII list and, voila.

def base64_to_decimal(self, char):
    for i, c in enumerate(self.base64_printable):
        if c == char:
            return i

def decode(self):
    padding = sum([1 for char in self.user_input if char == "="])
    self.user_input = self.user_input[:len(self.user_input) - padding]

    binary_message = b""
    for char in self.user_input:
        binary_message += self.decimal_to_binary(self.base64_to_decimal(char), 5)
    binary_message = binary_message[:len(binary_message) - (padding * 2)]

    decoded = ""
    for group_of_binary in self.create_groups_of_binary(binary_message, 8): 
        decoded += self.ascii_printable[self.binary_to_decimal(group_of_binary, 7) - 32]

    return decoded

If you’ve been following along, you should now have everything you need to encode ASCII into base64 and decode base64 back into ASCII. This should work very similar to the python base64 library except that the python version requires bit strings as input and this does not.

Now that it works, let’s extend it.

One of my objectives when writing this class was to be able to encode ANS.1 RSA keys. In order to do this, I needed to be able to encode hex strings using base64. So, the next section will focus on doing just that.

Fortunately, all of the building blocks are already there. Which means, I just need to make a few modifications in the encode function itself.

def encode(self, input_type="string"):
    binary_message = b""
    if input_type == "string":
        for char in self.user_input:
            binary_message += self.decimal_to_binary(self.ascii_to_decimal(char), 7)

    if input_type == "hexstring":
        for i in range(0, len(self.user_input), 2):
            binary_message += self.decimal_to_binary(int(self.user_input[i: i + 2], 16), 7)
        
    padding = 0
    while len(binary_message) % 6:
        binary_message += b"0"
        padding += 1

    encoded = ""
    for group_of_binary in self.create_groups_of_binary(binary_message, 6):
        encoded += self.base64_printable[self.binary_to_decimal(group_of_binary, 5)]

    return encoded + "=" * (padding // 2)

I added a default input type so that calling the encode method will default to ASCII unless called with ‘hexstring’ as a parameter.

In the case of hex string, each two characters represents a hexadecimal integer. Thus, I take two characters at a time and convert them to their proper decimal value and proceed from there.

There are two things that need to be changed for decoding:

def decode(self, output_type="string"):
    padding = sum([1 for char in self.user_input if char == "="])
    self.user_input = self.user_input[:len(self.user_input) - padding]

    binary_message = b""
    for char in self.user_input:
        binary_message += self.decimal_to_binary(self.base64_to_decimal(char), 5)
    binary_message = binary_message[:len(binary_message) - (padding * 2)]

    decoded = ""
    if output_type == "string":   
        for group_of_binary in self.create_groups_of_binary(binary_message, 8): 
            decoded += self.ascii_printable[self.binary_to_decimal(group_of_binary, 7) - 32]
    if output_type == "hexstring":
        for group_of_binary in self.create_groups_of_binary(binary_message, 16):
            if len(group_of_binary) == 8:
                decimal = hex(self.binary_to_decimal(group_of_binary, 7))[2:]
                if len(decimal) % 2:
                    decimal = "0" + decimal
            if len(group_of_binary) == 16:
                decimal = hex(self.binary_to_decimal(group_of_binary, 15))[2:]
                while len(decimal) % 4:
                    decimal = "0" + decimal
            decoded += decimal

    return decoded

A default case and hex string logic just like the encode method, and

def create_groups_of_binary(self, binary, group_size):
    groups = []
    start = 0
    stop = group_size
    while not stop > len(binary):
        groups.append(binary[start:stop])
        start = stop
        stop += group_size
    
    if not stop - len(binary) == group_size:
        groups.append(binary[start: len(binary)])

    return groups

Additional logic that accounts for any string length mismatches. This can happen because the ‘hex string’ is not a true hex string. Instead it is actually still an ASCII string being parsed as though it is a hex string. We already know that each ASCII character is 8 bits. Thus, two ASCII character, together representing a single hex bit is 16 bits. If a single character were present, it would not divide properly into the expected 16 bits.

And there you have it. This should now be able to take either hex string or ASCII characters as an input and encode them into base64. Or take base64 and decode it to ASCII or hex string.

Here’s some CLI code to get started playing around with it.

from argparse import ArgumentParser

if __name__ == "__main__":
    parser = ArgumentParser( prog="base64 (de|en)coder", description="A simple python implementation of encoding and decoding base64 strings")
    group = parser.add_mutually_exclusive_group(required=True)
    group.add_argument("-e", "--encode", help="Use this flag to encode a string into base64", action="store_true")
    group.add_argument("-d", "--decode", help="Use this flag to decode a base64 string into plaintext", action="store_true")
    parser.add_argument("user_input", metavar="input", nargs="+", help="The string you would like either encoded or decoded")
    parser.add_argument("--type", choices=["string", "hexstring"], help="Type of input/output - supports string and hexstring: Default is 'string'")
    args = parser.parse_args()

    user_input = " ".join(args.user_input)
    _type = "hexstring" if args.type == "hexstring" else "string"

    b64 = B64(user_input)
    if args.encode:
        print(f"\n'{user_input}' encoded into base64 is:\n{b64.encode(_type)}")
    else:
        print(f"\nbase64 string '{user_input}' decoded is:\n{b64.decode(_type)}")

My next post will be about ANS.1 objects. I’ll probably touch on PEM and you’ll get to see why base64 encoding is important to RSA. See you there!

Resources:

https://base64.guru/learn/base64-algorithm/encode

Python already includes a base64 library as part of the standard library. So why write our own?

Let’s look at how we can code this out for ourselves.

Before I get back to coding, I want sure up your understanding of the base64 algorithm.

Now that it works, let’s extend it.

Leave a Reply Cancel Reply