序列化方法

总述

序列化(serialization、marshalling)的过程是指将数据结构或者对象的状态转换成可以存储(比如文件、内存)或者传输的格式(比如网络)。反向操作就是反序列化(deserialization、unmarshalling)的过程。

1987年曾经的Sun Microsystems发布了XDR。

二十世纪九十年代后期，XML开始流行，它是一种人类易读的基于文本的编码方式，易于阅读和理解，但是失去了紧凑的基于字节流的编码的优势。

JSON是一种更轻量级的基于文本的编码方式，经常用在client/server端的通讯中。

YAML类似JSON，新的特性更强大，更适合人类阅读，也更紧凑。

还有苹果系统的property list。

除了上面这些和Protobuf，还有许许多多的序列化格式，比如Thrift、Avro、BSON、CBOR、MessagePack, 还有很多非跨语言的编码格式。项目gosercomp对比了各种go的序列化库，包括序列化和反序列的性能，以及序列化后的数据大小。总体来说Protobuf序列化和反序列的性能都是比较高的，编码后的数据大小也不错。

Protobuf支持很多语言，比如C++、C#、Dart、Go、Java、Python、Rust等，同时也是跨平台的，所以得到了广泛的应用。

Protobuf包含序列化格式的定义、各种语言的库以及一个IDL编译器。正常情况下你需要定义proto文件，然后使用IDL编译器编译成你需要的语言。

出于个人习惯，通常数据类型文件可以保存为文本型的json或二进制型的protobuf，配置文件可以保存为yaml或xml等。

Protobuf

https://developers.google.com/protocol-buffers/docs/cpptutorial

安装与卸载

# 安装
sudo pip install protobuf == 3.6.1
sudo apt install protobuf-compiler

# 卸载
pip uninstall protobuf
sudo rm /usr/local/bin/protoc

# 版本检查
./protoc --version
pip list | grep -i protobuf

protoc使用

protoc是官方提供的跨语言编码解码工具，通过该工具可以生成多种开发语言对应的protobuf协议文件，同时也可以实现二进制和文本形式的proto数据间的转换。

1	protoc --proto_path=./folder1/ --proto_path=./folder2 --python_out=./python_out/ ./folder1/a.proto ./folder2/b.proto

1
2
3

protoc --decode=MESSAGE_NAME --proto_path=./folder1/ --proto_path=./folder2 ./folder1/a.proto ./folder2/b.proto < binary_proto.bin

protoc --decode=[package].[Message type] proto.file < protobuf.response

1	protoc --encode=MESSAGE_NAME --proto_path=./folder1/ --proto_path=./folder2 ./folder1/a.proto ./folder2/b.proto

1	protoc --proto_path=${protobuf_path} --encode=${protobuf_message} ${protobuf_file} < ${source_file} > ${output_file}

更多详细的命令解释可以参考博客。

其它用法

Streaming Multiple Messages

If you want to write multiple messages to a single file or stream, it is up to you to keep track of where one message ends and the next begins. The Protocol Buffer wire format is not self-delimiting, so protocol buffer parsers cannot determine where a message ends on their own. The easiest way to solve this problem is to write the size of each message before you write the message itself. When you read the messages back in, you read the size, then read the bytes into a separate buffer, then parse from that buffer. (If you want to avoid copying bytes to a separate buffer, check out the CodedInputStream class (in both C++ and Java) which can be told to limit reads to a certain number of bytes.)

Self-describing Messages

Protocol Buffers do not contain descriptions of their own types. Thus, given only a raw message without the corresponding .proto file defining its type, it is difficult to extract any useful data.

However, note that the contents of a .proto file can itself be represented using protocol buffers. The file src/google/protobuf/descriptor.proto in the source code package defines the message types involved. protoc can output a FileDescriptorSet – which represents a set of .proto files – using the --descriptor_set_out option. With this, you could define a self-describing protocol message like so:

JSON

JSON格式定义

在JSON中每个对象都是一个无序的键值对的集合，每个对象以{开始，以}结束，每个键后都紧接:，键值对间使用,分隔。array是一系列值的有序集合，以[开始，以]结束，值之间通过,分隔。JSON中值可以为string，数字，true，false或null，以及对象和array，结构间允许嵌套。更多内容可以参考官方定义。

读写实例

Python

读文件实例：

import json

json_file = "./demo.json"
with open(json_file, encoding='utf-8') as f:
  data = json.load(f)
print(data)