Hey there Sign in to join this conversationNew here? Join for free

C++ and binary

Announcements Posted on
TSR wants you: get involved with Power Hour. 10-04-2014
    • Thread Starter
    • 6 followers
    Offline

    ReputationRep:
    When you open a file with c++ using something like this:

    Code:
    ifstream in ("sample.txt");
    
    string a;
    
    in >> a;
    then if sample.txt contained 'abc' so would string a.

    Now as all files are essentially a collection of 0s and 1s (right?) some sort of conversion is happening between the file (a collection of 0s and 1s) and how it is stored as a string 'abc'.

    My question is thus, how can I access the raw binary code and store it in a string? Is this even possible/make sense?
    • 3 followers
    Offline

    ReputationRep:
    Why do you want to do this? There's lots of alternatives, and it would help choose amongst them. Without more information, I'd use a std::stringstream - write an appropriate stream manipulator (std::dec, std::hex or std::oct) followed by each character as an unsigned int. Easiest way of getting the character as an unsigned int is probably to iterate through the string and cast each character. This has the benefit of not introducing a dependency on the I/O in what's essentially a function from strings to strings. Finally, you can use std::stringstream::str to get the converted string.
    • Thread Starter
    • 6 followers
    Offline

    ReputationRep:
    (Original post by TheUnbeliever)
    Why do you want to do this? There's lots of alternatives, and it would help choose amongst them. Without more information, I'd use a std::stringstream - write an appropriate stream manipulator (std::dec, std::hex or std:ct) followed by each character as an unsigned int. Easiest way of getting the character as an unsigned int is probably to iterate through the string and cast each character. This has the benefit of not introducing a dependency on the I/O in what's essentially a function from strings to strings. Finally, you can use std::stringstream::str to get the converted string.
    Ah thanks. I was going to try and make a simple compression algorithm (0000111 = (4*0) + (3*1). Not sure now!
    • 3 followers
    Offline

    ReputationRep:
    Ah, then you probably don't want to do what I described - open the file in binary mode: std::ifstream ifs(path, std::ios::binary) and use std::ifstream::read to get bytes. After that, I would suggest using bit masking instead of converting to a string of '0' and '1' characters. If you're looking for information online, this compression scheme is called run-length encoding. It's a good task! :-)
    • 12 followers
    Offline

    ReputationRep:
    (Original post by beepbeeprichie)
    When you open a file with c++ using something like this:

    Code:
    ifstream in ("sample.txt");
    
    string a;
    
    in >> a;
    then if sample.txt contained 'abc' so would string a.

    Now as all files are essentially a collection of 0s and 1s (right?) some sort of conversion is happening between the file (a collection of 0s and 1s) and how it is stored as a string 'abc'.

    My question is thus, how can I access the raw binary code and store it in a string? Is this even possible/make sense?
    Read this: http://en.wikipedia.org/wiki/Ascii

    That explains one of the most common ways to interpret bytes as characters. Other text encoding schemes exist, but ASCII is the default in C++ (as far as I know, willing to be corrected here). Worth noting that ASCII maps a 7 bit number to a character, but the C++ 'char' type is an 8 bit number (i.e. one byte). I think it ignores the top bit.

    An important thing to understand is that the character '1' is not really equivalent to a binary 1 (according to the ascii table its value is actually 00110001). A binary 1 or 0 is 1 bit (by definition). Where as the character '1' or the character '0' is actually 8 bits. It's actually not that simple to convert some raw binary data into a string of '1' and '0' characters.

    First thing you want to do is change

    ifstream in ("sample.txt");

    to

    ifstream in ( "sample.txt" , ifstream::in | ifstream::binary );

    This will treat the file as binary not text. This used to confuse me, because it's all really just binary right? Well the difference is how it treats special control characters. In ASCII there are special control characters to do things like new line. In text mode it may do some conversions on things like this (I don't know the details off the top of my head), but binary mode guarantees the data you read is exactly what's in the file.

    Here's one way to convert to a string representing the binary. The basic concept is to treat each character in the original string as a number and keep dividing it by 2. If there is a remainder, you add a '1' character to the output string. If there is no remainder, you add a '0' to the output string. Once you've done that 8 times (because a 'char' is 8 bits). The string you end up with will be the binary number backwards, so reverse it to get the correct string.
    • 1 follower
    Offline

    ReputationRep:
    There are two types of IO, Character based which use the encoding mention above (ASCII etc..) and Binary based. Using binary based streams allows you to grab the actual binary data from the file.

    For POSIX compliant OS's (windows has partial support) you can use the functions fread and fwrite to grab binary data, otherwise you'll have to look into your systems API for your OS.

    For more information look into the man pages.
    • 0 followers
    Offline

    ReputationRep:
    On most windows implementations, IIRC, reading in binary and text modes is effectively equivalent, (ie. windows does not actually have a "text mode" internally, all reads are binary).

    As for your original question, a string is fundamentally just an array of characters*.
    A char is, fundamentally speaking, just a byte**. Reading a file into a string is nearly equivalent to just reading the file bytes into an array. (The differences include treatment of nulls and line endings. And more generally charset issues.)
    Hence you're perfectly entitled to just address each character and interpret it as you see fit.

    *I'm ignoring MBCS here.
    **I know it's not necessarily a byte, but most coders seem to live in an 7-bit character world :P
    • 0 followers
    Offline

    ReputationRep:
    On most windows implementations, IIRC, reading in binary and text modes is effectively equivalent, (ie. windows does not actually have a "text mode" internally, all reads are binary).

    As for your original question, a string is fundamentally just an array of characters*.
    A char is, fundamentally speaking, just a byte**. Reading a file into a string is nearly equivalent to just reading the file bytes into an array. (The differences include treatment of nulls and line endings. And more generally charset issues.)
    Hence you're perfectly entitled to just address each character and interpret it as you see fit.

    *I'm ignoring MBCS here.
    **I know it's not necessarily a byte, but most coders seem to live in an 7-bit character world :P
    • 3 followers
    Offline

    ReputationRep:
    (Original post by JGR)
    **I know it's not necessarily a byte, but most coders seem to live in an 7-bit character world :P
    If we're playing the pedantry game, a char is a byte but not necessarily an octet.
    • 0 followers
    Offline

    ReputationRep:
    (Original post by TheUnbeliever)
    If we're playing the pedantry game, a char is a byte but not necessarily an octet.
    You could argue that a char doesn't have to be a byte regardless of whether a byte is an octet. :P

    There are other nasty bits like signedness which also cause hassle.
    • 3 followers
    Offline

    ReputationRep:
    (Original post by JGR)
    You could argue that a char doesn't have to be a byte regardless of whether a byte is an octet. :P
    Not sure what you mean. The number of bits in a byte is defined to be CHAR_BITS. Whether char is signed or not has little to do with whether it is a byte. However, this is wildly off-topic so I'll drop it.
    • 0 followers
    Offline

    ReputationRep:
    (Original post by JGR)
    You could argue that a char doesn't have to be a byte regardless of whether a byte is an octet. :P

    There are other nasty bits like signedness which also cause hassle.
    A byte is usually defined as the smallest addressable unit of memory. I think he was meaning that a char was always a byte even on systems where CHAR_BIT > 8.
    • 0 followers
    Offline

    ReputationRep:
    (Original post by TheUnbeliever)
    Not sure what you mean. The number of bits in a byte is defined to be CHAR_BITS. Whether char is signed or not has little to do with whether it is a byte. However, this is wildly off-topic so I'll drop it.
    For all systems that I know of, yes. I was just being pedantic.
    The signedness issue is just another gotcha worth mentioning.


    My overall point is that assuming that a character (in the text sense of the word) is always equivalent to a char (in the typedef sense of the word) can lead to some unpleasant pitfalls if care is not taken.


    But at any rate, I don't expect that the OP will be dealing with anything other than plain ASCII or CP1252.
    • 1 follower
    Offline

    ReputationRep:
    Binary Binary Binary Binary Binary Binary Binary Binary Binary

    Sorry couldn't help it I love Doctor Who too much
    • 3 followers
    Offline

    ReputationRep:
    (Original post by JGR)
    For all systems that I know of, yes.
    Sorry to come back to this, but this isn't what I meant or said - maturestudy has me right.

    Spoiler:
    Show
    The standard literally says

    The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain any member of the basic execution character set (2.3) and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is implementation defined
    and later, defines CHAR_BIT to take the definition given in the C standard, which is that

    number of bits for smallest object that is not a bit-field (byte)
    And just to flog a dead donkey:

    Values stored in non-bit-field objects of any other object type consist of n × CHAR_BIT bits, where n is the size of an object of that type, in bytes.
    A byte contains CHAR_BIT bits


    Regardless of whether binary or text mode is identical under certain sets of circumstances, I don't see any reason not to use binary mode. Most importantly, it ensures you get the right behaviour from a standard-compliant implementation. Secondly, it documents your expectation of that behaviour. At the cost of about a dozen extra characters, I don't see where this discussion is coming from.
    • 1 follower
    Offline

    ReputationRep:
    (Original post by maturestudy)
    A byte is usually defined as the smallest addressable unit of memory. I think he was meaning that a char was always a byte even on systems where CHAR_BIT > 8.
    This is wrong. A system that uses (32-bit) word-addressing for memory would then make your byte 32-bits? While it may be true that most user computers today are byte-addressed and that byte happens to be 8-bits long, it is not safe to think of a byte in this way.

    A byte is defined as (C99 specification) as the smallest number of bits required to hold a character. ASCII is the standard encoding so that is 7-bits (for the character) + 1 parity bit = 8 bits. The previous posters are right, a char is ALWAYS a byte but the byte don't have to be 8-bits and hence the char aint 8-bits.

    Along with the sizes of bytes, you should also be considering the endianness of your architecture when reading binary. Portable code should't assume little endianness or 8-bit bytes.
    • 0 followers
    Offline

    ReputationRep:
    (Original post by Oxy)
    This is wrong. A system that uses (32-bit) word-addressing for memory would then make your byte 32-bits? While it may be true that most user computers today are byte-addressed and that byte happens to be 8-bits long, it is not safe to think of a byte in this way.
    http://www.thefreedictionary.com/usually

    I have written code on DSPs that have 10-bit bytes and some do, indeed, have 32-bit bytes.
    • 1 follower
    Offline

    ReputationRep:
    (Original post by maturestudy)
    http://www.thefreedictionary.com/usually

    I have written code on DSPs that have 10-bit bytes and some do, indeed, have 32-bit bytes.
    Yeah congrats mate but DSP's are completely different to general purpose CPU's in almost every way. Also, my point wasn't that you can't get 32-bit bytes, it was simply saying that your way of reasoning about the size of a byte from the memory addressing is wrong.
    • 12 followers
    Offline

    ReputationRep:
    If I ever have to use a system which doesn't have an 8 bit byte, I think my brain will collapse in on itself

    Different words lengths I can manage with (well, I'll probably have to soon), but it would be hard to think of a byte as anything but 8 bits. Luckily, I doubt any of the platforms I'm likely to work on will redefine that.
    • 0 followers
    Offline

    ReputationRep:
    (Original post by TheUnbeliever)
    Sorry to come back to this, but this isn't what I meant or said - maturestudy has me right.

    Regardless of whether binary or text mode is identical under certain sets of circumstances, I don't see any reason not to use binary mode. Most importantly, it ensures you get the right behaviour from a standard-compliant implementation. Secondly, it documents your expectation of that behaviour. At the cost of about a dozen extra characters, I don't see where this discussion is coming from.
    That's fair enough, I hadn't rechecked the spec.

    The point about binary/text mode was a reference to Oxy's comment which implied that it may not work on non-POSIX systems (ie. MS windows). My point being that it would, and in any case wouldn't (broadly speaking) make a difference in that case.

Reply

Submit reply

Register

Thanks for posting! You just need to create an account in order to submit the post
  1. this can't be left blank
    that username has been taken, please choose another Forgotten your password?

    this is what you'll be called on TSR

  2. this can't be left blank
    this email is already registered. Forgotten your password?

    never shared and never spammed

  3. this can't be left blank

    6 characters or longer with both numbers and letters is safer

  4. this can't be left empty
    your full birthday is required
  1. By completing the slider below you agree to The Student Room's terms & conditions and site rules

  2. Slide the button to the right to create your account

    Slide to join now Processing…

    You don't slide that way? No problem.

Updated: May 13, 2012
Article updates
Useful resources
Reputation gems:
You get these gems as you gain rep from other members for making good contributions and giving helpful advice.