C++ and binary
From C++ to PHP, debugging to webhosting; help and discussion about writing your latest program to running your website. NOT for help when your PC won't work.
| Announcements | Posted on | |
|---|---|---|
| TSR launches Learn Together! - Our new subscription to help improve your learning | 16-05-2013 | |
-
C++ and binary
When you open a file with c++ using something like this:
then if sample.txt contained 'abc' so would string a.Code:ifstream in ("sample.txt"); string a; in >> a;
Now as all files are essentially a collection of 0s and 1s (right?) some sort of conversion is happening between the file (a collection of 0s and 1s) and how it is stored as a string 'abc'.
My question is thus, how can I access the raw binary code and store it in a string? Is this even possible/make sense? -
Re: C++ and binary
Why do you want to do this? There's lots of alternatives, and it would help choose amongst them. Without more information, I'd use a std::stringstream - write an appropriate stream manipulator (std::dec, std::hex or std::oct) followed by each character as an unsigned int. Easiest way of getting the character as an unsigned int is probably to iterate through the string and cast each character. This has the benefit of not introducing a dependency on the I/O in what's essentially a function from strings to strings. Finally, you can use std::stringstream::str to get the converted string.
-
Re: C++ and binaryAh thanks. I was going to try and make a simple compression algorithm (0000111 = (4*0) + (3*1). Not sure now!(Original post by TheUnbeliever)
Why do you want to do this? There's lots of alternatives, and it would help choose amongst them. Without more information, I'd use a std::stringstream - write an appropriate stream manipulator (std::dec, std::hex or std:
ct) followed by each character as an unsigned int. Easiest way of getting the character as an unsigned int is probably to iterate through the string and cast each character. This has the benefit of not introducing a dependency on the I/O in what's essentially a function from strings to strings. Finally, you can use std:
:str to get the converted string.
-
Re: C++ and binary
Ah, then you probably don't want to do what I described - open the file in binary mode: std::ifstream ifs(path, std::ios::binary) and use std::ifstream::read to get bytes. After that, I would suggest using bit masking instead of converting to a string of '0' and '1' characters. If you're looking for information online, this compression scheme is called run-length encoding. It's a good task! :-)
-
Re: C++ and binaryRead this: http://en.wikipedia.org/wiki/Ascii(Original post by beepbeeprichie)
When you open a file with c++ using something like this:
then if sample.txt contained 'abc' so would string a.Code:ifstream in ("sample.txt"); string a; in >> a;
Now as all files are essentially a collection of 0s and 1s (right?) some sort of conversion is happening between the file (a collection of 0s and 1s) and how it is stored as a string 'abc'.
My question is thus, how can I access the raw binary code and store it in a string? Is this even possible/make sense?
That explains one of the most common ways to interpret bytes as characters. Other text encoding schemes exist, but ASCII is the default in C++ (as far as I know, willing to be corrected here). Worth noting that ASCII maps a 7 bit number to a character, but the C++ 'char' type is an 8 bit number (i.e. one byte). I think it ignores the top bit.
An important thing to understand is that the character '1' is not really equivalent to a binary 1 (according to the ascii table its value is actually 00110001). A binary 1 or 0 is 1 bit (by definition). Where as the character '1' or the character '0' is actually 8 bits. It's actually not that simple to convert some raw binary data into a string of '1' and '0' characters.
First thing you want to do is change
ifstream in ("sample.txt");
to
ifstream in ( "sample.txt" , ifstream::in | ifstream::binary );
This will treat the file as binary not text. This used to confuse me, because it's all really just binary right? Well the difference is how it treats special control characters. In ASCII there are special control characters to do things like new line. In text mode it may do some conversions on things like this (I don't know the details off the top of my head), but binary mode guarantees the data you read is exactly what's in the file.
Here's one way to convert to a string representing the binary. The basic concept is to treat each character in the original string as a number and keep dividing it by 2. If there is a remainder, you add a '1' character to the output string. If there is no remainder, you add a '0' to the output string. Once you've done that 8 times (because a 'char' is 8 bits). The string you end up with will be the binary number backwards, so reverse it to get the correct string. -
Re: C++ and binary
There are two types of IO, Character based which use the encoding mention above (ASCII etc..) and Binary based. Using binary based streams allows you to grab the actual binary data from the file.
For POSIX compliant OS's (windows has partial support) you can use the functions fread and fwrite to grab binary data, otherwise you'll have to look into your systems API for your OS.
For more information look into the man pages. -
Re: C++ and binary
On most windows implementations, IIRC, reading in binary and text modes is effectively equivalent, (ie. windows does not actually have a "text mode" internally, all reads are binary).
As for your original question, a string is fundamentally just an array of characters*.
A char is, fundamentally speaking, just a byte**. Reading a file into a string is nearly equivalent to just reading the file bytes into an array. (The differences include treatment of nulls and line endings. And more generally charset issues.)
Hence you're perfectly entitled to just address each character and interpret it as you see fit.
*I'm ignoring MBCS here.
**I know it's not necessarily a byte, but most coders seem to live in an 7-bit character world :P -
Re: C++ and binary
On most windows implementations, IIRC, reading in binary and text modes is effectively equivalent, (ie. windows does not actually have a "text mode" internally, all reads are binary).
As for your original question, a string is fundamentally just an array of characters*.
A char is, fundamentally speaking, just a byte**. Reading a file into a string is nearly equivalent to just reading the file bytes into an array. (The differences include treatment of nulls and line endings. And more generally charset issues.)
Hence you're perfectly entitled to just address each character and interpret it as you see fit.
*I'm ignoring MBCS here.
**I know it's not necessarily a byte, but most coders seem to live in an 7-bit character world :P -
Re: C++ and binaryIf we're playing the pedantry game, a char is a byte but not necessarily an octet.(Original post by JGR)
**I know it's not necessarily a byte, but most coders seem to live in an 7-bit character world :P -
Re: C++ and binaryYou could argue that a char doesn't have to be a byte regardless of whether a byte is an octet. :P(Original post by TheUnbeliever)
If we're playing the pedantry game, a char is a byte but not necessarily an octet.
There are other nasty bits like signedness which also cause hassle. -
Re: C++ and binaryNot sure what you mean. The number of bits in a byte is defined to be CHAR_BITS. Whether char is signed or not has little to do with whether it is a byte. However, this is wildly off-topic so I'll drop it.(Original post by JGR)
You could argue that a char doesn't have to be a byte regardless of whether a byte is an octet. :P -
Re: C++ and binaryA byte is usually defined as the smallest addressable unit of memory. I think he was meaning that a char was always a byte even on systems where CHAR_BIT > 8.(Original post by JGR)
You could argue that a char doesn't have to be a byte regardless of whether a byte is an octet. :P
There are other nasty bits like signedness which also cause hassle. -
Re: C++ and binaryFor all systems that I know of, yes. I was just being pedantic.(Original post by TheUnbeliever)
Not sure what you mean. The number of bits in a byte is defined to be CHAR_BITS. Whether char is signed or not has little to do with whether it is a byte. However, this is wildly off-topic so I'll drop it.
The signedness issue is just another gotcha worth mentioning.
My overall point is that assuming that a character (in the text sense of the word) is always equivalent to a char (in the typedef sense of the word) can lead to some unpleasant pitfalls if care is not taken.
But at any rate, I don't expect that the OP will be dealing with anything other than plain ASCII or CP1252. -
Re: C++ and binarySorry to come back to this, but this isn't what I meant or said - maturestudy has me right.(Original post by JGR)
For all systems that I know of, yes.
Spoiler:ShowThe standard literally says
and later, defines CHAR_BIT to take the definition given in the C standard, which is thatThe fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain any member of the basic execution character set (2.3) and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is implementation defined
And just to flog a dead donkey:number of bits for smallest object that is not a bit-field (byte)
Values stored in non-bit-field objects of any other object type consist of n × CHAR_BIT bits, where n is the size of an object of that type, in bytes.A byte contains CHAR_BIT bits
Regardless of whether binary or text mode is identical under certain sets of circumstances, I don't see any reason not to use binary mode. Most importantly, it ensures you get the right behaviour from a standard-compliant implementation. Secondly, it documents your expectation of that behaviour. At the cost of about a dozen extra characters, I don't see where this discussion is coming from. -
Re: C++ and binaryThis is wrong. A system that uses (32-bit) word-addressing for memory would then make your byte 32-bits? While it may be true that most user computers today are byte-addressed and that byte happens to be 8-bits long, it is not safe to think of a byte in this way.(Original post by maturestudy)
A byte is usually defined as the smallest addressable unit of memory. I think he was meaning that a char was always a byte even on systems where CHAR_BIT > 8.
A byte is defined as (C99 specification) as the smallest number of bits required to hold a character. ASCII is the standard encoding so that is 7-bits (for the character) + 1 parity bit = 8 bits. The previous posters are right, a char is ALWAYS a byte but the byte don't have to be 8-bits and hence the char aint 8-bits.
Along with the sizes of bytes, you should also be considering the endianness of your architecture when reading binary. Portable code should't assume little endianness or 8-bit bytes. -
Re: C++ and binaryhttp://www.thefreedictionary.com/usually(Original post by Oxy)
This is wrong. A system that uses (32-bit) word-addressing for memory would then make your byte 32-bits? While it may be true that most user computers today are byte-addressed and that byte happens to be 8-bits long, it is not safe to think of a byte in this way.
I have written code on DSPs that have 10-bit bytes and some do, indeed, have 32-bit bytes.Last edited by maturestudy; 12-05-2012 at 15:42. -
Re: C++ and binaryYeah congrats mate but DSP's are completely different to general purpose CPU's in almost every way. Also, my point wasn't that you can't get 32-bit bytes, it was simply saying that your way of reasoning about the size of a byte from the memory addressing is wrong.(Original post by maturestudy)
http://www.thefreedictionary.com/usually
I have written code on DSPs that have 10-bit bytes and some do, indeed, have 32-bit bytes. -
Re: C++ and binary
If I ever have to use a system which doesn't have an 8 bit byte, I think my brain will collapse in on itself

Different words lengths I can manage with (well, I'll probably have to soon), but it would be hard to think of a byte as anything but 8 bits. Luckily, I doubt any of the platforms I'm likely to work on will redefine that. -
Re: C++ and binaryThat's fair enough, I hadn't rechecked the spec.(Original post by TheUnbeliever)
Sorry to come back to this, but this isn't what I meant or said - maturestudy has me right.
Regardless of whether binary or text mode is identical under certain sets of circumstances, I don't see any reason not to use binary mode. Most importantly, it ensures you get the right behaviour from a standard-compliant implementation. Secondly, it documents your expectation of that behaviour. At the cost of about a dozen extra characters, I don't see where this discussion is coming from.
The point about binary/text mode was a reference to Oxy's comment which implied that it may not work on non-POSIX systems (ie. MS windows). My point being that it would, and in any case wouldn't (broadly speaking) make a difference in that case.
ct) followed by each character as an unsigned int. Easiest way of getting the character as an unsigned int is probably to iterate through the string and cast each character. This has the benefit of not introducing a dependency on the I/O in what's essentially a function from strings to strings. Finally, you can use std: