最近搜到一些不错的资料,随手做些读书笔记
http://www.zefhemel.com/archives/2004/09/01/the-share-nothing-architecture
Scalability is “the capability of a system to increase performance under an increased load when resources (typically hardware) are added.” (source: Wikipedia)
第一段解释伸缩性是什么意思,伸缩性不等同性能,当你的服务器投入是两倍的时候,
性能能提升两倍,当你的服务器在增加的时候,业务不停顿… 这才是伸缩性
A big scalability problem with caching data is called the cache-coherence problem.
第二段解释一个严重影响伸缩性方面的问题,cache一致性
第三段推销shared-nothing架构:至少在web server上什么数据也不要共享,session
可以通过文件(NFS远程访问)或者数据库来集中维护,而数据库有较好的扩展性(ebay
就是这样做的)
考虑到这是一个04年的帖子,也就不说什么了,当年这位仁兄的经验还是比较肤浅的。
=====================================
http://wutaoo.javaeye.com/blog/148369
from:http://highscalability.com/sharding-hibernate-way
Shard: splitting up your data sets. If your data doesn’t fit on one machine you split it up into pieces and each piece is called a shard.
Sharding: the process of splitting up data.
Shards is for situations where you have too much data to fit in a single database. MySQL partitioning may allow you to delay when you need to shard, but it is still a single database and you’ll eventually run into limits.
shard与分区的区别在于,分区是在单db中进行,而shard是一种数据划分思想,更多体现在将数据分布在多台db中
Control over how data are distributed is determined by a pluggable strategies layer.
— hibernate的实现方式
Plan for the future by picking a strategy that will last you a long time.
Repartitioning/resharding the data is operationally very difficult. No management tools for this yet
— 对shard的策略选择非常重要,后悔药吃起来很难受
后面一堆都是介绍这个策略层的限制和功能
=====================================
http://www.mysqlperformanceblog.com/2008/03/14/sharding-and-time-base-partitioning/
However this may be not the most optimal approach by itself because not all data belonging to same user is equal.
— 介绍shard模型并不适用于所有类型的data,尤其数据的重要性或者热的程度不一
— 作者建议某些数据,可以在shard的基础上,再基于时间维度或者热点维度进行分区
— 作者以为用了shard之后,就不需要master-slave架构了,其实在大的系统中,还是需要用master-slave架构提升单节点的健壮度的,1 master : n slave架构还可以提升单节点的最大利用率,通过把非关键业务的重型查询语句部署在其中一台slave上,也可以避免非关键业务对关键业务的影响;
http://mysqldba.blogspot.com/2006/11/unorthodox-approach-to-database-design.html
— friendster架构
— Each 64bit AMD server would house 500K distinct users and all their data
— 以用户为核心,存储他所有的相关信息
— 提到replication的几个问题,其中”IO bandwidth is low replication lags, causing slave lag”是我们最需要考量的
=====================================
http://highscalability.com/unorthodox-approach-database-design-coming-shard
介绍了shard的优势
a. High availability
— shard提供的额外优势就是可以提供部分服务
b. Faster queries.
c. More write bandwidth
d. You can do more work
— 并发吞吐增加
shard对比传统架构的不同:
a.Data are denormalized
— 非规范化设计,相同主key的数据存储在一起(第一次见到这种论述)
— You can keep a user’s profile data separate from their comments, blogs, email, media, etc, but the user profile data would be stored and retrieved as a whole
b.Data are parallelized across many physical instances
c.Data are kept small
d.Data are more highly available
— You can also setup a shard to have a master-slave or dual master relationship within the shard to avoid a single point of failure within the shard
e.It doesn’t use replication
— 数据切分或者传递用非replication方式,避免对sharding的误会
纯粹用replication进行scaling,是会有写瓶颈的(在livejournal
就在这上面有过痛苦的经验)
sharding的问题
1. Rebalancing data
— 如果某些用户的数据过于肥大,需要重新平衡各个节点的数据量,
这是一个痛苦的过程(google有自调整功能),flickr有全局
name service来定位数据位置,对于传统按模路由的架构,
这个问题是迟早要解决的
— And your references must be invalidateable so the underlying data can be moved while you are using it.
— 对于这个,我的理解是每个数据源还有相关的引用数据(比如好友的nick),这些引用数据可以降低对主(基础)数据的依赖,这样即使基础数据在迁移,某些业务还能继续使用。
2. Joining data from multiple shards
You have to make individual requests to your data sources, get all the responses, and the build the page
— amazon用了一种并行查询机制来提升查询效率,这方面值得我们学习
3. How do you partition your data in shards?
— Unfortunately there are no easy answer to these questions.
— 确实,每个业务都不是完全一样的,需要根据你自己的业务去衡量
4. Less leverage
— 较少文献介绍这方面的知识,大部分时候“you are on your own”
5. Implementing shards is not well supported
— 路要靠自己走,工具要靠自己做
=====================================
http://blog.gigaspaces.com//2007/04/06/shared-nothing-architecture-redefined/
— Today we have better ways to remove data dependency, without putting the data into a shared file system — which may eventually become a bottleneck. We partition it and store it in-memory
— giga经常吹嘘自己的space-base-architecture,本质上似乎与sna差不多,
http://en.wikipedia.org/wiki/Shard
–database sharding is a method for database partitioning which involves partitioning
across multiple servers in a shared nothing architecture.
–shard名词似乎是来自MMOG